Index out of bounds, how to see which row is having it?

NLSVTN · April 13, 2021, 6:32pm

I have two datasets:

Main dataset which I am trying to annotate - grch38_test.vcf
NISC annotation dataset

Here is what I am doing. Loading the main dataset and splitting it:

mt = hl.import_vcf('file:///grch38_test.vcf', reference_genome='GRCh38', force_bgz= **True** )
mt = hl.split_multi_hts(mt.annotate_rows(locus_old=mt.locus, alleles_old=mt.alleles), permit_shuffle= **True** )

Loading the annotation dataset and trying to produce annotation to each row of the main dataset:

nisc = hl.read_table('file:///NISC.ht')
res = hl.struct(**{'AC': nisc[mt.row_key].info.AC[mt.a_index-1], 
                   'AF': nisc[mt.row_key].info.AF[mt.a_index-1]})

Now, if I trigger actual computation of the struct it fails:

res.collect()

Hail version: 0.2.63-cb767a7507c8
Error summary: HailException: array index out of bounds: index=1, length=1
----------
Python traceback:
  File "<ipython-input-7-b20bb2a7b3af>", line 1, in <module>
    res = hl.struct(**{'AC': nisc[mt.row_key].info.AC[mt.a_index-1],

How can I see which row fails here and why? I need to identify the row in NISC perhaps that gives the issue.

johnc1231 · April 13, 2021, 6:37pm

Maybe something like:

mt.filter_rows(hl.len(nisc[mt.row_key].info.AC) >= mt.a_index - 1)

to get a table of only the problematic ones?

NLSVTN · April 21, 2021, 9:16pm

I faced an issue though. I am subsetting then NISC dataset later on:

mt_problem = mt.filter_rows(hl.len(nisc[mt.row_key].info.AC) >= mt.a_index - 1)
nisc_problem = nisc[mt_problem.row_key].info.AC

And what I see when I print it by using show() function is the following:

±--------------±----------------±-------------+
| locus | alleles | |
±--------------±----------------±-------------+
| locus | array | array |
±--------------±----------------±-------------+
| chr20:87623 | [“T”,“C”] | [8] |
| chr20:96321 | [“T”,“C”] | [241] |
| chr20:96372 | [“C”,“T”] | [5] |
| chr20:145508 | [“T”,“C”] | [6] |
| chr20:145514 | [“GCAAA”,“G”] | [49] |
| chr20:145572 | [“T”,“C”] | [2] |

When I take another non-problematic datset cidr and run the same functions with it the output is the same (same variant positions) which is wrong because it does not have any issues with it, it runs fine.

I have also counted the number of allegedly problematic rows of mt rows with cidr (which does not have any in reality) and it is 11034 while with nisc I see 4689. So, the query - mt.filter_rows(hl.len(nisc[mt.row_key].info.AC) >= mt.a_index - 1) - does not seem to return correct results.

I verified that the number of items in the AC are 1 and a_index is 1 in these cases, so these are not problematic rows.

Topic		Replies	Views
Error index out of bounds Hail Query & hailctl	6	1239	July 27, 2022
ArrayIndexOutOfBoundsException Hail Query & hailctl	22	1224	November 21, 2019
PL of haploid call induced error when split_multi_hts(), in VCF produced by Dragen pipeline Hail Query & hailctl	2	494	February 10, 2023
Array index out of bounds error Hail Query & hailctl	19	1033	July 26, 2022
ArrayIndexOutOfBoundsException using _cdf_combine Hail Query & hailctl	7	31	December 11, 2024

Index out of bounds, how to see which row is having it?

Related topics