Convert Clair3 gVCF to VDS failed with NumberFormatException error

Hi,

Hope to get some insight into the issue I had above and thanks in advanced.

I was trying to import 2 gVCF files named “sample1.wf_snp.gvcf.gz” and “sample2.wf_snp.gvcf.gz” intp hail and converting them into VDS with code below

sample1_gvcf_uri = <s3_location_of_gvcf>
sample2_gvcf_uri = <s3_location_of_gvcf>

gvcfs = [
    sample1_gvcf_uri,
    sample2_gvcf_uri,
]

combiner = hl.vds.new_combiner(
    output_path=vds_uri,
    temp_path=f'{vds_prefix}/checkpoints/',
    gvcf_paths=gvcfs,
    use_genome_default_intervals=True,
    reference_genome='GRCh38',
    branch_factor=50, # number of inputs combined in one VDS
    target_records=500_000 # number of rows per partition
)

And this simple run produces an error which is found below and the example line

...error while parsing line
chr2	64442119	.	C	CAAAAAAAAAAAA,CAAAAAAAAA,<NON_REF>	8.49	PASS	F	GT:GQ:DP:AD:AF:PL	1/2:8:21:0,6,5,0:0.2857,0.2381:59,28,4,28,0,4,990,990,990,990

NumberFormatException: For input string: "0.2857,0.2381"

The error probably was refering to the AF INFO field.

I have tried to import this vcf with

mt = hl.import_vcf(sample1_gvcf_uri, reference_genome="GRCh38", force_bgz=True)
mt.describe()

This works with some warning complaining that my gvcf should end with “.vcf[.bgz, .gz]” but it still manages to import that, assuming the mt.describe() outputs the various mt metric it is supposed to and hence by inference the import worked…

Is there any way I can bypass this issue with the new_combiner?

That looks malformed, I would expect AF to have the same number of elements as AD. What’s your full stack trace for this error? Also, what’s the header line for AF?

vds_combiner_errorTrace.txt (39.0 KB)

Hi,

Full error trace found above.

The header line for AF

##FORMAT=<ID=AF,Number=1,Type=Float,Description="Observed allele frequency in reads, for each ALT allele, in the same order as listed, or the REF allele for a RefCall">

Thanks

ZH

Your files are malformed. Number=1 means that there should only be a single value. We are correctly rejecting 0.50,0.50 as not being a properly formatted float.