VariantDatasetCombiner - dataset contains both multiallelic variants and duplicated loci for review

Hello,

I am trying to run VariantDatasetCombiner:

path_to_input_list = ‘input_files.txt’ # a file with one GVCF path per line

gvcfs =
with hl.hadoop_open(path_to_input_list, ‘r’) as f:
for line in f:
print(line)
gvcfs.append(line.strip())

combiner = hl.vds.new_combiner(
output_path=‘dataset.vds’,
temp_path=‘my-temp-bucket’,
gvcf_paths=gvcfs,
use_genome_default_intervals=True,
reference_genome=‘GRCh38’,
)

combiner.run()

But an error showed up:

Current key: { locus: { contig: chrM, position: 302 }, alleles: [3; A, AC, ACC] }
Previous key: { locus: { contig: chrM, position: 302 }, alleles: [3; A, AC, C] }
This error can occur after a split_multi if the dataset
contains both multiallelic variants and duplicated loci.

How can I tell the combiner not to split_multi or how should I preprocess the gvfc files?

In addition I would like to mention that using hl.experimental.run_combiner was working with the same files:

import hail as hl

path_to_input_list = ‘input_files.txt’ # a file with one GVCF path per line

inputs =
with hl.hadoop_open(path_to_input_list, ‘r’) as f:
for line in f:
print(line)
inputs.append(line.strip())

output_file = ‘output.mt’ # output destination
temp_bucket = ‘my-temp-bucket’ # bucket for storing intermediate files
hl.experimental.run_combiner(inputs, out_file=output_file, tmp_path=temp_bucket, reference_genome=‘GRCh38’, use_genome_default_intervals=True, overwrite=True)

But I need to use methods that VariantDataset includes in my analysis.

I would appreciate your help.

Can you share the full stack trace to this error? I’m certainly concerned to see this error popping out of the VDS combiner.

It is too long I send it in a text file.

And thank you for trying to help me. :slight_smile:
error_hail.txt (40.6 KB)

Is it possible that your GVCFs have duplicate loci? Could you use tabix to look up chrM:302 for each GVCF and ensure that each GVCF has at most one record for that site?

  1. gvcf:
    image

2.gvcf:
image

3.gvcf do not have chrM:302
image

Ack, this is the reason. Hail’s combiner data model assumes that GVCFs use multiallelic rather than split biallelic representations for multiple polymorphisms at the same locus. What caller are these from? Is there an unsplit (multiallelic) version you could use?

To be honest I do not know, I will need to get this information from leader of my project. Probably tomorrow.

And if we do not have another variant of gvfc, is there any way around this?