Hi Hail team,
Thanks as always for the great tool and support! I’m currently working with a joint VCF generated from the DRAGEN pipeline and performing genotype QC using Hail (v0.2.135). When trying to annotate the AD field from LAD, I encountered an error during .write() due to local_to_global reporting out-of-bound allele indices.
Here’s the relevant part of my pipeline:
mt = hl.read_matrix_table(dir_output +'/data/' + prefix +'.VQ1.mt')
#LA
mt = mt.annotate_entries(
LA = hl.if_else(
hl.is_defined(mt.LAA),
hl.array([0]).extend(mt.LAA),
hl.missing(hl.tarray(hl.tint32))
)
)
#AD
mt = mt.annotate_entries(
AD = hl.if_else(
hl.is_defined(mt.LAD) & hl.is_defined(mt.LA) & (mt.LAD.length() == mt.LA.length()),
hl.vds.local_to_global(
mt.LAD,
mt.LA,
n_alleles = mt.alleles.length(),
fill_value = 0,
number = 'R'
),
hl.if_else(
mt.GT.is_hom_ref() & hl.is_defined(mt.DP),
hl.vds.local_to_global(
hl.array([mt.DP]),
hl.array([0]),
n_alleles = mt.alleles.length(),
fill_value = 0,
number = 'R'
),
hl.missing(hl.tarray(hl.tint32))
)
)
)
mt.write(dir_output +'/data/' + prefix +'.VQ1.GQ.mt', overwrite=True)
But this throws the following error:
Traceback (most recent call last):
File "/data1/mycho/WGS_AD_QC/01.SMC_250520_AWS/scripts/pipe_02.genotypeQC.py", line 229, in <module>
mt_genotypeQCed.write(dir_output +'/data/' + prefix +'.VQ1.GQ.mt', overwrite=True)
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/decorator.py", line 235, in fun
return caller(func, *(extras + args), **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/typecheck/check.py", line 584, in wrapper
return __original_func(*args_, **kwargs_)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/matrixtable.py", line 2808, in write
^M[Stage 62:> (0 + 9) / 3804]^M Env.backend().execute(ir.MatrixWrite(self._mir, writer))
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/backend/spark_backend.py", line 217, in execute
raise err
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/backend/spark_backend.py", line 209, in execute
return super().execute(ir, timed)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/backend/backend.py", line 181, in execute
raise e.maybe_user_error(ir) from None
hail.utils.java.HailUserError: Error summary: HailException: local_to_global: local allele of 2 out of bounds given n_total_alleles of 2
------------
------------
Hail stack trace:
File "/data1/mycho/WGS_AD_QC/01.SMC_250520_AWS/scripts/pipe_02.genotypeQC.py", line 152, in <module>
hl.vds.local_to_global(
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/vds/functions.py", line 94, in local_to_global
return _func("local_to_global_a_r", array.dtype, array, local_alleles, n_alleles, fill_value, omit_first)
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/expr/functions.py", line 117, in _func
ir.Apply(name, ret_type, *(a._ir for a in args), type_args=type_args), ret_type, indices, aggregations
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/ir/ir.py", line 3195, in __init__
self.save_error_info()
^M[Stage 62:> (0 + 1) / 3804]^M
To investigate, I checked the entries where the allele index in LAA exceeds the length of alleles, and indeed about 1.29% of entries are invalid. Here’s how I explored it:
et = mt.entries()
et_invalid = et.filter(
hl.is_defined(et.LA) &
hl.any(lambda x: x >= hl.len(et.alleles), et.LA)
)
et_invalid.key_by().select('locus', 'alleles', 'LA', 'LAD', 'LAA').show(10)
invalid_entries = et_invalid.count()
total_entries = et.count()
print(f"Invalid entries: {invalid_entries:,} / {total_entries:,} "
f"({invalid_entries / total_entries * 100:.4f}%)")
Invalid entries (LA out of bound):
+---------------+-------------+--------------+--------------+--------------+
| locus | alleles | LA | LAD | LAA |
+---------------+-------------+--------------+--------------+--------------+
| locus<GRCh38> | array<str> | array<int32> | array<int32> | array<int32> |
+---------------+-------------+--------------+--------------+--------------+
| chr1:10815 | ["T","TAA"] | [0,3] | [0,5] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,5] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,5] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,11] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,4] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,7] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,9] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,5] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,4] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [1,7] | [3] |
+---------------+-------------+--------------+--------------+--------------+
showing top 10 rows
Invalid entries: 1,526,641,810 / 118,375,165,780 (1.2897%)
LA : After filter, 2337 samples and 57093957 variants remain.
LA_filtering : After filter, 2337 samples and 57093957 variants remain.
Since this seems to be due to malformed or mismatched allele indices in LAA or LA, I filtered these entries out using:
mt = mt.filter_entries(
~(hl.is_defined(mt.LA) & hl.any(lambda x: x >= hl.len(mt.alleles), mt.LA))
)
This resolves the issue and allows .write() to proceed.
Questions:
-
What is the best practice for handling out-of-bound local allele indices from
LAA
?
– Is filtering those entries the right approach, or is there a recommended way to validate/fix them before annotation? -
Is this a known issue when importing DRAGEN joint-called VCFs into Hail?
– Should I be preprocessing or sanitizing the VCF before loading? -
Could this be caused by inconsistent allele indexing between DRAGEN and Hail?
– If so, is there a way to map or reindexLAA
to match Hail’s expectations?
Any advice or guidance would be much appreciated! I’m happy to share additional metadata or specific rows if needed.
Thanks in advance!
—
Hail version: 0.2.135
Spark version: 3.5.6
VCF source: DRAGEN pipeline (joint-called)