Hi Hail team,
Thanks as always for the great tool and support! I’m currently working with a joint VCF generated from the DRAGEN pipeline and performing genotype QC using Hail (v0.2.135). When trying to annotate the AD field from LAD, I encountered an error during .write() due to local_to_global reporting out-of-bound allele indices.
Here’s the relevant part of my pipeline:
mt = hl.read_matrix_table(dir_output +'/data/' + prefix +'.VQ1.mt')
#LA
mt = mt.annotate_entries(
LA = hl.if_else(
hl.is_defined(mt.LAA),
hl.array([0]).extend(mt.LAA),
hl.missing(hl.tarray(hl.tint32))
)
)
#AD
mt = mt.annotate_entries(
AD = hl.if_else(
hl.is_defined(mt.LAD) & hl.is_defined(mt.LA) & (mt.LAD.length() == mt.LA.length()),
hl.vds.local_to_global(
mt.LAD,
mt.LA,
n_alleles = mt.alleles.length(),
fill_value = 0,
number = 'R'
),
hl.if_else(
mt.GT.is_hom_ref() & hl.is_defined(mt.DP),
hl.vds.local_to_global(
hl.array([mt.DP]),
hl.array([0]),
n_alleles = mt.alleles.length(),
fill_value = 0,
number = 'R'
),
hl.missing(hl.tarray(hl.tint32))
)
)
)
mt.write(dir_output +'/data/' + prefix +'.VQ1.GQ.mt', overwrite=True)
But this throws the following error:
Traceback (most recent call last):
File "/data1/mycho/WGS_AD_QC/01.SMC_250520_AWS/scripts/pipe_02.genotypeQC.py", line 229, in <module>
mt_genotypeQCed.write(dir_output +'/data/' + prefix +'.VQ1.GQ.mt', overwrite=True)
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/decorator.py", line 235, in fun
return caller(func, *(extras + args), **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/typecheck/check.py", line 584, in wrapper
return __original_func(*args_, **kwargs_)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/matrixtable.py", line 2808, in write
^M[Stage 62:> (0 + 9) / 3804]^M Env.backend().execute(ir.MatrixWrite(self._mir, writer))
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/backend/spark_backend.py", line 217, in execute
raise err
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/backend/spark_backend.py", line 209, in execute
return super().execute(ir, timed)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/backend/backend.py", line 181, in execute
raise e.maybe_user_error(ir) from None
hail.utils.java.HailUserError: Error summary: HailException: local_to_global: local allele of 2 out of bounds given n_total_alleles of 2
------------
------------
Hail stack trace:
File "/data1/mycho/WGS_AD_QC/01.SMC_250520_AWS/scripts/pipe_02.genotypeQC.py", line 152, in <module>
hl.vds.local_to_global(
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/vds/functions.py", line 94, in local_to_global
return _func("local_to_global_a_r", array.dtype, array, local_alleles, n_alleles, fill_value, omit_first)
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/expr/functions.py", line 117, in _func
ir.Apply(name, ret_type, *(a._ir for a in args), type_args=type_args), ret_type, indices, aggregations
File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/ir/ir.py", line 3195, in __init__
self.save_error_info()
^M[Stage 62:> (0 + 1) / 3804]^M
To investigate, I checked the entries where the allele index in LAA exceeds the length of alleles, and indeed about 1.29% of entries are invalid. Here’s how I explored it:
et = mt.entries()
et_invalid = et.filter(
hl.is_defined(et.LA) &
hl.any(lambda x: x >= hl.len(et.alleles), et.LA)
)
et_invalid.key_by().select('locus', 'alleles', 'LA', 'LAD', 'LAA').show(10)
invalid_entries = et_invalid.count()
total_entries = et.count()
print(f"Invalid entries: {invalid_entries:,} / {total_entries:,} "
f"({invalid_entries / total_entries * 100:.4f}%)")
Invalid entries (LA out of bound):
+---------------+-------------+--------------+--------------+--------------+
| locus | alleles | LA | LAD | LAA |
+---------------+-------------+--------------+--------------+--------------+
| locus<GRCh38> | array<str> | array<int32> | array<int32> | array<int32> |
+---------------+-------------+--------------+--------------+--------------+
| chr1:10815 | ["T","TAA"] | [0,3] | [0,5] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,5] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,5] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,11] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,4] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,7] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,9] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,5] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [0,4] | [3] |
| chr1:10815 | ["T","TAA"] | [0,3] | [1,7] | [3] |
+---------------+-------------+--------------+--------------+--------------+
showing top 10 rows
Invalid entries: 1,526,641,810 / 118,375,165,780 (1.2897%)
LA : After filter, 2337 samples and 57093957 variants remain.
LA_filtering : After filter, 2337 samples and 57093957 variants remain.
Since this seems to be due to malformed or mismatched allele indices in LAA or LA, I filtered these entries out using:
mt = mt.filter_entries(
~(hl.is_defined(mt.LA) & hl.any(lambda x: x >= hl.len(mt.alleles), mt.LA))
)
This resolves the issue and allows .write() to proceed.
Questions:
-
What is the best practice for handling out-of-bound local allele indices from
LAA?
– Is filtering those entries the right approach, or is there a recommended way to validate/fix them before annotation? -
Is this a known issue when importing DRAGEN joint-called VCFs into Hail?
– Should I be preprocessing or sanitizing the VCF before loading? -
Could this be caused by inconsistent allele indexing between DRAGEN and Hail?
– If so, is there a way to map or reindexLAAto match Hail’s expectations?
Any advice or guidance would be much appreciated! I’m happy to share additional metadata or specific rows if needed.
Thanks in advance!
—
Hail version: 0.2.135
Spark version: 3.5.6
VCF source: DRAGEN pipeline (joint-called)