VariantDatasetCombiner - dataset contains both multiallelic variants and duplicated loci for review

Hello,

I am trying to run VariantDatasetCombiner:

path_to_input_list = ‘input_files.txt’ # a file with one GVCF path per line

gvcfs =
with hl.hadoop_open(path_to_input_list, ‘r’) as f:
for line in f:
print(line)
gvcfs.append(line.strip())

combiner = hl.vds.new_combiner(
output_path=‘dataset.vds’,
temp_path=‘my-temp-bucket’,
gvcf_paths=gvcfs,
use_genome_default_intervals=True,
reference_genome=‘GRCh38’,
)

combiner.run()

But an error showed up:

Current key: { locus: { contig: chrM, position: 302 }, alleles: [3; A, AC, ACC] }
Previous key: { locus: { contig: chrM, position: 302 }, alleles: [3; A, AC, C] }
This error can occur after a split_multi if the dataset
contains both multiallelic variants and duplicated loci.

How can I tell the combiner not to split_multi or how should I preprocess the gvfc files?

In addition I would like to mention that using hl.experimental.run_combiner was working with the same files:

import hail as hl

path_to_input_list = ‘input_files.txt’ # a file with one GVCF path per line

inputs =
with hl.hadoop_open(path_to_input_list, ‘r’) as f:
for line in f:
print(line)
inputs.append(line.strip())

output_file = ‘output.mt’ # output destination
temp_bucket = ‘my-temp-bucket’ # bucket for storing intermediate files
hl.experimental.run_combiner(inputs, out_file=output_file, tmp_path=temp_bucket, reference_genome=‘GRCh38’, use_genome_default_intervals=True, overwrite=True)

But I need to use methods that VariantDataset includes in my analysis.

I would appreciate your help.

Can you share the full stack trace to this error? I’m certainly concerned to see this error popping out of the VDS combiner.

It is too long I send it in a text file.

And thank you for trying to help me. :slight_smile:
error_hail.txt (40.6 KB)

Is it possible that your GVCFs have duplicate loci? Could you use tabix to look up chrM:302 for each GVCF and ensure that each GVCF has at most one record for that site?

  1. gvcf:
    image

2.gvcf:
image

3.gvcf do not have chrM:302
image

Ack, this is the reason. Hail’s combiner data model assumes that GVCFs use multiallelic rather than split biallelic representations for multiple polymorphisms at the same locus. What caller are these from? Is there an unsplit (multiallelic) version you could use?

To be honest I do not know, I will need to get this information from leader of my project. Probably tomorrow.

And if we do not have another variant of gvfc, is there any way around this?

I got the same error on chrM 302 locus. It’s a multi allele. My gVCF is from Dragen v.4.0.

Is there any function to use specific contigs or exclude chrM in hl.vds.new_combiner.

I thought contig_recoding might work on specific contig but it did not work (though it meant to work for rename contigs).

combiner = hl.vds.new_combiner(
    output_path='/Users/joonan/tmp/output/dataset.vds',
    temp_path='/Users/joonan/tmp/temp',
    gvcf_paths=gvcfs,
    reference_genome='GRCh38',
    contig_recoding = {'chr1':'chr1'},
    use_genome_default_intervals=True,
)

Thanks,

There is not currently a way to filter GVCFs during combination (I’ve created an issue for this).

Can you confirm that your GVCF file has multiple lines with the 302 locus? Do you know if Dragen has a configuration that prevent this?

In the short-term, I think your best bet is to filter the GVCFs by hand with awk or something. Apologies!

Hi Dan,

Thanks for prompt reply. I am not sure whether Dragen has a configuration for this but re-processing with a new configuration for thousands samples would be difficult (regarding cost). awk or bcftools are something I can work on but not really fast as Hail. Lot of pain. It would be great if you have some features to include specific contigs in future dev.

Best,

Hey @joonan30 ! I’m tracking this down a bit more to hopefully save others this trouble in the future. Did you use a mitochondrial specific version of the DRAGEN caller or did this use the regular settings on data from chrM?

@joonan30 , any chance you could share the settings you used for DRAGEN?