Local_to_global error when annotating LA from DRAGEN joint VCF: allele index out of bounds

Hi Hail team,

Thanks as always for the great tool and support! I’m currently working with a joint VCF generated from the DRAGEN pipeline and performing genotype QC using Hail (v0.2.135). When trying to annotate the AD field from LAD, I encountered an error during .write() due to local_to_global reporting out-of-bound allele indices.

Here’s the relevant part of my pipeline:

mt = hl.read_matrix_table(dir_output +'/data/' + prefix +'.VQ1.mt')

#LA
mt = mt.annotate_entries(
    LA = hl.if_else(
        hl.is_defined(mt.LAA),
        hl.array([0]).extend(mt.LAA),
        hl.missing(hl.tarray(hl.tint32))
    )
)

#AD
mt = mt.annotate_entries(
    AD = hl.if_else(
        hl.is_defined(mt.LAD) & hl.is_defined(mt.LA) & (mt.LAD.length() == mt.LA.length()),
        hl.vds.local_to_global(
            mt.LAD,
            mt.LA,
            n_alleles = mt.alleles.length(),
            fill_value = 0,
            number = 'R'
        ),
        hl.if_else(
            mt.GT.is_hom_ref() & hl.is_defined(mt.DP),
            hl.vds.local_to_global(
                hl.array([mt.DP]),
                hl.array([0]),
                n_alleles = mt.alleles.length(),
                fill_value = 0,
                number = 'R'
            ),
            hl.missing(hl.tarray(hl.tint32))
        )
    )
)

mt.write(dir_output +'/data/' + prefix +'.VQ1.GQ.mt', overwrite=True)

But this throws the following error:

Traceback (most recent call last):
  File "/data1/mycho/WGS_AD_QC/01.SMC_250520_AWS/scripts/pipe_02.genotypeQC.py", line 229, in <module>
    mt_genotypeQCed.write(dir_output +'/data/' + prefix +'.VQ1.GQ.mt', overwrite=True)
  File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/decorator.py", line 235, in fun
    return caller(func, *(extras + args), **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/typecheck/check.py", line 584, in wrapper
    return __original_func(*args_, **kwargs_)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/matrixtable.py", line 2808, in write
^M[Stage 62:>                                                      (0 + 9) / 3804]^M    Env.backend().execute(ir.MatrixWrite(self._mir, writer))
  File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/backend/spark_backend.py", line 217, in execute
    raise err
  File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/backend/spark_backend.py", line 209, in execute
    return super().execute(ir, timed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/backend/backend.py", line 181, in execute
    raise e.maybe_user_error(ir) from None
hail.utils.java.HailUserError: Error summary: HailException: local_to_global: local allele of 2 out of bounds given n_total_alleles of 2
------------
------------
Hail stack trace:
  File "/data1/mycho/WGS_AD_QC/01.SMC_250520_AWS/scripts/pipe_02.genotypeQC.py", line 152, in <module>
    hl.vds.local_to_global(

  File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/vds/functions.py", line 94, in local_to_global
    return _func("local_to_global_a_r", array.dtype, array, local_alleles, n_alleles, fill_value, omit_first)

  File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/expr/functions.py", line 117, in _func
    ir.Apply(name, ret_type, *(a._ir for a in args), type_args=type_args), ret_type, indices, aggregations

  File "/data1/software/anaconda3/envs/hail_update/lib/python3.11/site-packages/hail/ir/ir.py", line 3195, in __init__
    self.save_error_info()

^M[Stage 62:>                                                      (0 + 1) / 3804]^M

To investigate, I checked the entries where the allele index in LAA exceeds the length of alleles, and indeed about 1.29% of entries are invalid. Here’s how I explored it:

et = mt.entries()

et_invalid = et.filter(
    hl.is_defined(et.LA) &
    hl.any(lambda x: x >= hl.len(et.alleles), et.LA)
)

et_invalid.key_by().select('locus', 'alleles', 'LA', 'LAD', 'LAA').show(10)

invalid_entries = et_invalid.count()
total_entries = et.count()
print(f"Invalid entries: {invalid_entries:,} / {total_entries:,} "
      f"({invalid_entries / total_entries * 100:.4f}%)")
Invalid entries (LA out of bound):
+---------------+-------------+--------------+--------------+--------------+
| locus         | alleles     | LA           | LAD          | LAA          |
+---------------+-------------+--------------+--------------+--------------+
| locus<GRCh38> | array<str>  | array<int32> | array<int32> | array<int32> |
+---------------+-------------+--------------+--------------+--------------+
| chr1:10815    | ["T","TAA"] | [0,3]        | [0,5]        | [3]          |
| chr1:10815    | ["T","TAA"] | [0,3]        | [0,5]        | [3]          |
| chr1:10815    | ["T","TAA"] | [0,3]        | [0,5]        | [3]          |
| chr1:10815    | ["T","TAA"] | [0,3]        | [0,11]       | [3]          |
| chr1:10815    | ["T","TAA"] | [0,3]        | [0,4]        | [3]          |
| chr1:10815    | ["T","TAA"] | [0,3]        | [0,7]        | [3]          |
| chr1:10815    | ["T","TAA"] | [0,3]        | [0,9]        | [3]          |
| chr1:10815    | ["T","TAA"] | [0,3]        | [0,5]        | [3]          |
| chr1:10815    | ["T","TAA"] | [0,3]        | [0,4]        | [3]          |
| chr1:10815    | ["T","TAA"] | [0,3]        | [1,7]        | [3]          |
+---------------+-------------+--------------+--------------+--------------+
showing top 10 rows

Invalid entries: 1,526,641,810 / 118,375,165,780 (1.2897%)
LA : After filter, 2337 samples and 57093957 variants remain.
LA_filtering : After filter, 2337 samples and 57093957 variants remain.

Since this seems to be due to malformed or mismatched allele indices in LAA or LA, I filtered these entries out using:

mt = mt.filter_entries(
    ~(hl.is_defined(mt.LA) & hl.any(lambda x: x >= hl.len(mt.alleles), mt.LA))
)

This resolves the issue and allows .write() to proceed.

Questions:

  1. What is the best practice for handling out-of-bound local allele indices from LAA?
    – Is filtering those entries the right approach, or is there a recommended way to validate/fix them before annotation?

  2. Is this a known issue when importing DRAGEN joint-called VCFs into Hail?
    – Should I be preprocessing or sanitizing the VCF before loading?

  3. Could this be caused by inconsistent allele indexing between DRAGEN and Hail?
    – If so, is there a way to map or reindex LAA to match Hail’s expectations?

Any advice or guidance would be much appreciated! I’m happy to share additional metadata or specific rows if needed.

Thanks in advance!

Hail version: 0.2.135

Spark version: 3.5.6

VCF source: DRAGEN pipeline (joint-called)

This looks like a data error in the VCF produced by DRAGEN.

  1. Given that we can’t know why the values are out of bounds, filtering them is probably the right call. Along with reporting the data error to the upstream producer of the VCF.
  2. We haven’t seen this issue before. We’ll keep it in mind if we start seeing a lot more of it.
  3. I don’t think so, I’m not familiar with how DRAGEN handles this data however. The max value of LAA should never be greater than the length of the alleles array. This just looks wrong to me.

Thanks so much for the clarification!

That makes sense — I’ll proceed with filtering the problematic records and follow up with the data provider about the VCF issue. Really appreciate your insight and confirmation that this isn’t a known behavior.

Thanks again for your help!