PL of haploid call induced error when split_multi_hts(), in VCF produced by Dragen pipeline

Hi All,

I’m processing VCF produced by Dragen pipeline with Hail. When using split_multi_hts() to split multi-allelic site variants, I get the error of index out of bounds like below.

HailUserError: Error summary: HailException: array index out of bounds: index=10, length=7
------------
Hail stack trace:
  File "<ipython-input-5-6c3d12705311>", line 2, in <module>
    mt = hl.split_multi_hts(mt)

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/methods/statgen.py", line 2168, in split_multi_hts
    (hl.range(0, 3).map(lambda i:

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/methods/statgen.py", line 2172, in <lambda>
    ).map(lambda j: split.PL[j]))))))

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/methods/statgen.py", line 2172, in <lambda>
    ).map(lambda j: split.PL[j]))))))

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/expr/expressions/typed_expressions.py", line 481, in __getitem__
    return self._method("indexArray", self.dtype.element_type, item)

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py", line 695, in _method
    x = ir.Apply(name, ret_type, self._ir, *(a._ir for a in args))

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/ir/ir.py", line 2628, in __init__
    self.save_error_info()

I saw the similar case like me (Error index out of bounds). In this query, above error is due to PL fields. So I filtered calls which length of PL is 7 with following code, as the error message indicates.

tmp.filter_entries(tmp.PL.length() == 7).entries().show()

And I found those calls were from the sex chromosome of male samples, so they were all haploid calls.
In my VCF, haploid calls are presented as haploid genotypes. This could make trouble in split_multi_hts() function because the function assumes that all calls are presented as diploidy. I’ll have to look for more, but Dragen pipeline basically represent haploid calls this way when producing VCF, unlike GATK. In line with this, impute_sex() function also do not work. I think this could be a problem for people who use Hail to process Dragen-produced VCF.

So here is my question. Is there a way to change haploid genotypes to diploid? Or any way to handle non-diploid calls when applying Hail functions?

1 Like

Hey @Leehyeji789 !

Sorry you’re running into this. I’ll look into the possibility of improving Hail to handle mixed diploid-haploid datasets. In the meantime, see below.


EDIT: I chatted with some analysts here. Until we have bandwidth to fix Hail’s methods to support haploids, your best bet is to recode to diploid:

mt = mt.annotate_entries(
    GT = hl.if_else(
             mt.GT.ploidy == 1, 
             hl.call(mt.GT[0], mt.GT[0]),
             mt.GT)
)

Can you share the script you executed? That will help us understand exactly what the issues are. I think the simplest thing for you to do is to analyze the haploid Y samples separately. Concretely,

mt = hl.read_matrix_table(...)
mt = mt.filter_rows(mt.locus.contig != 'chrY') # or 'Y' depending on your reference

And then just do your analysis on diploids. Separately, you can analyze the Y chromosome:

mt = hl.read_matrix_table(...)
mt = mt.filter_rows(mt.locus.contig == 'chrY')

Thank you for answer! I will apply the alternative way you suggest.
Here is the code I executed, and entire log file.

dragen_test_vcf_log.txt (1.6 MB)

mt = hl.import_vcf("dragen-joint-test.vcf.gz",
                   reference_genome='GRCh38', 
                   force_bgz=True, 
                   skip_invalid_loci=True, 
                   n_partitions=144)

d = hl.split_multi_hts(mt)
d.write("dragne-wgs-test.mt", overwrite=True)

--------------------------------------------------------------------------
HailUserError                             Traceback (most recent call last)
<ipython-input-5-6c3d12705311> in <module>
      1 mt = hl.split_multi_hts(mt)
----> 2 mt.write("dragne-wgs-test.mt", overwrite=True)

<decorator-gen-1278> in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)

~/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

~/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/matrixtable.py in write(self, output, overwrite, stage_locally, _codec_spec, _partitions, _checkpoint_file)
   2556 
   2557         writer = ir.MatrixNativeWriter(output, overwrite, stage_locally, _codec_spec, _partitions, _partitions_type, _checkpoint_file)
-> 2558         Env.backend().execute(ir.MatrixWrite(self._mir, writer))
   2559 
   2560     class _Show:

~/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
    102             return (value, timings) if timed else value
    103         except FatalError as e:
--> 104             self._handle_fatal_error_from_backend(e, ir)
    105 
    106     async def _async_execute(self, ir, timed=False):

~/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/backend/backend.py in _handle_fatal_error_from_backend(self, err, ir)
    187                              'Hail stack trace:\n'
    188                              f'{better_stack_trace}')
--> 189         raise HailUserError(message_and_trace) from None

HailUserError: Error summary: HailException: array index out of bounds: index=2, length=2
------------
Hail stack trace:
  File "<ipython-input-5-6c3d12705311>", line 2, in <module>
    mt = hl.split_multi_hts(mt)

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/methods/statgen.py", line 2168, in split_multi_hts
    (hl.range(0, 3).map(lambda i:

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/methods/statgen.py", line 2172, in <lambda>
    ).map(lambda j: split.PL[j]))))))

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/methods/statgen.py", line 2172, in <lambda>
    ).map(lambda j: split.PL[j]))))))

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/expr/expressions/typed_expressions.py", line 481, in __getitem__
    return self._method("indexArray", self.dtype.element_type, item)

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py", line 695, in _method
    x = ir.Apply(name, ret_type, self._ir, *(a._ir for a in args))

  File "/home/sonic/bin/anaconda/envs/hail/lib/python3.7/site-packages/hail/ir/ir.py", line 2628, in __init__
    self.save_error_info()