DNAnexus of course.
Running hl.import_vcf
on filenames containing brackets e.g. file:///mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr1/ukb24310_c1_b92_v1.vcf.gz
(why anyone would use this folder name in the first place) throws HailException: arguments refer to no files
.
It might just be my regex skills but neither ["[500k release]", "[[]500k release[]]", "\[500k release\]", "\\[500k release\\]"]
work. I have not tried to reproduce it outside their environment but I suspect the issue might be general.
from pyspark.sql import SparkSession
import hail as hl
builder = (
SparkSession
.builder
.enableHiveSupport()
)
spark = builder.getOrCreate()
hl.init(sc=spark.sparkContext)
WGS_PATH = f'file:///mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr1/ukb24310_c1_b92_v1.vcf.gz'
mt = hl.import_vcf(
WGS_PATH,
force_bgz = True,
reference_genome="GRCh38",
array_elements_required=False,
)
> 2024-01-18 10:21:07.456 Hail: WARN: 'file:///mnt/project/Bulk/DRAGEN WGS/DRAGEN population level WGS variants, pVCF format [500k release]/chr1/ukb24310_c1_b92_v1.vcf.gz' refers to no files
Here is the response from support:
Hello Jakob,
We have noticed that HAIL has problem when file paths have special characters like “”. This character is only introduced in the 500k release.
The workaround now would be to change the corresponding folder name, in the project, to remove these two characters, and try running the code again.
To move/rename a folder, please go to a local terminal and use dx mv command:
dx mv -h usage: dx mv [-h] [–env-help] [-a] source [source …] destination Move or rename data objects and/or folders inside a single project.
Please do note that this moving command must be carried out before a jupyterlab session is opened.
Has anyone encountered something similar? Or even better found a fix that does not involve duplicating a very large amount of data?
Best,
Jakob
EDIT: hail==0.2.116