HailException: array index out of bounds when converting Hail Table to Pandas DataFrame

Description:
We are processing VCF files stored in S3 using Hail in Python, and encountering the following error when calling .to_pandas() on a selected Hail Table:

hail.utils.java.HailException: array index out of bounds: index=3, length=3

Relevant Code Context:

  • We download the VCF from S3 to a local temp path.

  • We load it using:

mt = hl.import_vcf(temp_file_path, force_bgz=True, reference_genome=reference_clean,skip_invalid_loci=True, min_partitions=300, array_elements_required=False)

  • We split multi-allelic sites:

bi = mt.filter_rows(hl.len(mt.alleles) == 2).annotate_rows(a_index=1, was_split=False)

multi = mt.filter_rows(hl.len(mt.alleles) > 2)

split = hl.split_multi_hts(multi)

mt = split.union_rows(bi)

  • We annotate with RSID and extract the entries table:

row_key_table = mt.rows().key_by(‘locus’, ‘alleles’).select(‘rsid’)

entries_table = mt.entries()

entries_table = entries_table.annotate(ID=row_key_table[entries_table.locus, entries_table.alleles].rsid)

  • Then we select columns:

selected_table = entries_table.select(

CHROM=entries_table.locus.contig,

POS=entries_table.locus.position,

ID=entries_table.ID,

REF=entries_table.alleles[0],

ALT=entries_table.alleles[1:], # list to preserve multi-ALT

QUAL=entries_table.qual,

FILTER=hl.if_else(hl.len(entries_table.filters) == 0, hl.array([‘PASS’]), hl.array(entries_table.filters)),

ORIGINAL_LOCUS=entries_table.original_locus,

ORIGINAL_ALLELES=entries_table.original_alleles,

**{f’FORMAT.{f}': entries_table[f] for f in format_fields_filtered if f in entries_table.row},

**{f’INFO.{k}': entries_table.info[k] for k in entries_table.info}

)

  • The error occurs when calling:

df = selected_table.to_pandas()

Additional Notes:

  • Error happens only on certain VCFs, not all.

  • The VCFs are annotated from different tools (snpEff, VEP, BCF).

  • We have confirmed the file has valid variant rows before loading.

  • Hail version: 0.2.78

  • Full error log is attached for reference.

Questions:

  1. What could cause this array index out of bounds in to_pandas() for entries_table/selected_table?

  2. Could it be related to ALT=entries_table.alleles[1:] handling multi-allelic records or missing values after split?

  3. Any best practices to defensively handle such cases before converting to Pandas?

Hi @tpoterba

Could you please share any insights on this issue, in case you’ve come across something similar before?

Hi @hrithikgupta88,

Sorry for the late response. Is this still a blocking issue for you? It is definitely a bug, but it would be hard for us to debug without access to a reproducing example.