VariantDatasetCombiner - dataset contains both multiallelic variants and duplicated loci for review

gameforcela12 · January 3, 2023, 2:50pm

Hello,

I am trying to run VariantDatasetCombiner:

path_to_input_list = ‘input_files.txt’ # a file with one GVCF path per line

gvcfs =
with hl.hadoop_open(path_to_input_list, ‘r’) as f:
for line in f:
print(line)
gvcfs.append(line.strip())

combiner = hl.vds.new_combiner(
output_path=‘dataset.vds’,
temp_path=‘my-temp-bucket’,
gvcf_paths=gvcfs,
use_genome_default_intervals=True,
reference_genome=‘GRCh38’,
)

combiner.run()

But an error showed up:

Current key: { locus: { contig: chrM, position: 302 }, alleles: [3; A, AC, ACC] }
Previous key: { locus: { contig: chrM, position: 302 }, alleles: [3; A, AC, C] }
This error can occur after a split_multi if the dataset
contains both multiallelic variants and duplicated loci.

How can I tell the combiner not to split_multi or how should I preprocess the gvfc files?

In addition I would like to mention that using hl.experimental.run_combiner was working with the same files:

import hail as hl

path_to_input_list = ‘input_files.txt’ # a file with one GVCF path per line

inputs =
with hl.hadoop_open(path_to_input_list, ‘r’) as f:
for line in f:
print(line)
inputs.append(line.strip())

output_file = ‘output.mt’ # output destination
temp_bucket = ‘my-temp-bucket’ # bucket for storing intermediate files
hl.experimental.run_combiner(inputs, out_file=output_file, tmp_path=temp_bucket, reference_genome=‘GRCh38’, use_genome_default_intervals=True, overwrite=True)

But I need to use methods that VariantDataset includes in my analysis.

I would appreciate your help.

tpoterba · January 3, 2023, 2:57pm

Can you share the full stack trace to this error? I’m certainly concerned to see this error popping out of the VDS combiner.

gameforcela12 · January 3, 2023, 3:02pm

It is too long I send it in a text file.

And thank you for trying to help me.
error_hail.txt (40.6 KB)

tpoterba · January 3, 2023, 3:16pm

Is it possible that your GVCFs have duplicate loci? Could you use tabix to look up chrM:302 for each GVCF and ensure that each GVCF has at most one record for that site?

gameforcela12 · January 3, 2023, 3:34pm

gvcf:

gameforcela12 · January 3, 2023, 3:34pm

2.gvcf:

gameforcela12 · January 3, 2023, 3:35pm

3.gvcf do not have chrM:302

tpoterba · January 3, 2023, 3:36pm

Ack, this is the reason. Hail’s combiner data model assumes that GVCFs use multiallelic rather than split biallelic representations for multiple polymorphisms at the same locus. What caller are these from? Is there an unsplit (multiallelic) version you could use?

gameforcela12 · January 3, 2023, 3:38pm

To be honest I do not know, I will need to get this information from leader of my project. Probably tomorrow.

And if we do not have another variant of gvfc, is there any way around this?

joonan30 · June 30, 2023, 2:48pm

I got the same error on chrM 302 locus. It’s a multi allele. My gVCF is from Dragen v.4.0.

Is there any function to use specific contigs or exclude chrM in hl.vds.new_combiner.

I thought contig_recoding might work on specific contig but it did not work (though it meant to work for rename contigs).

combiner = hl.vds.new_combiner(
    output_path='/Users/joonan/tmp/output/dataset.vds',
    temp_path='/Users/joonan/tmp/temp',
    gvcf_paths=gvcfs,
    reference_genome='GRCh38',
    contig_recoding = {'chr1':'chr1'},
    use_genome_default_intervals=True,
)

Thanks,

danking · June 30, 2023, 3:32pm

There is not currently a way to filter GVCFs during combination (I’ve created an issue for this).

Can you confirm that your GVCF file has multiple lines with the 302 locus? Do you know if Dragen has a configuration that prevent this?

In the short-term, I think your best bet is to filter the GVCFs by hand with awk or something. Apologies!

joonan30 · June 30, 2023, 3:45pm

Hi Dan,

Thanks for prompt reply. I am not sure whether Dragen has a configuration for this but re-processing with a new configuration for thousands samples would be difficult (regarding cost). awk or bcftools are something I can work on but not really fast as Hail. Lot of pain. It would be great if you have some features to include specific contigs in future dev.

Best,

danking · July 13, 2023, 6:14pm

Hey @joonan30 ! I’m tracking this down a bit more to hopefully save others this trouble in the future. Did you use a mitochondrial specific version of the DRAGEN caller or did this use the regular settings on data from chrM?

danking · July 24, 2023, 6:14pm

@joonan30 , any chance you could share the settings you used for DRAGEN?

Topic		Replies	Views
VariantDatasetCombiner - dataset contains both multiallelic variants and duplicated loci Hail Query & hailctl	1	348	January 17, 2023
Duplicated Loci error on VDS Combiner Hail Query & hailctl	3	87	June 20, 2024
Possible vcf_combiner issue Hail Query & hailctl	19	1246	June 15, 2020
How "alleles" filed in a key generated Hail Query & hailctl	3	387	March 16, 2021
Experimental.run_combiner produces null genotypes Hail Query & hailctl	1	314	September 8, 2022

VariantDatasetCombiner - dataset contains both multiallelic variants and duplicated loci for review

Related topics