Index out of bounds, how to see which row is having it?

I have two datasets:

  1. Main dataset which I am trying to annotate - grch38_test.vcf
  2. NISC annotation dataset

Here is what I am doing. Loading the main dataset and splitting it:

mt = hl.import_vcf('file:///grch38_test.vcf', reference_genome='GRCh38', force_bgz= **True** )
mt = hl.split_multi_hts(mt.annotate_rows(locus_old=mt.locus, alleles_old=mt.alleles), permit_shuffle= **True** )

Loading the annotation dataset and trying to produce annotation to each row of the main dataset:

nisc = hl.read_table('file:///NISC.ht')
res = hl.struct(**{'AC': nisc[mt.row_key].info.AC[mt.a_index-1], 
                   'AF': nisc[mt.row_key].info.AF[mt.a_index-1]})

Now, if I trigger actual computation of the struct it fails:

res.collect()

Hail version: 0.2.63-cb767a7507c8
Error summary: HailException: array index out of bounds: index=1, length=1
----------
Python traceback:
  File "<ipython-input-7-b20bb2a7b3af>", line 1, in <module>
    res = hl.struct(**{'AC': nisc[mt.row_key].info.AC[mt.a_index-1],

How can I see which row fails here and why? I need to identify the row in NISC perhaps that gives the issue.

Maybe something like:

mt.filter_rows(hl.len(nisc[mt.row_key].info.AC) >= mt.a_index - 1)

to get a table of only the problematic ones?

1 Like

I faced an issue though. I am subsetting then NISC dataset later on:

mt_problem = mt.filter_rows(hl.len(nisc[mt.row_key].info.AC) >= mt.a_index - 1)
nisc_problem = nisc[mt_problem.row_key].info.AC

And what I see when I print it by using show() function is the following:

±--------------±----------------±-------------+
| locus | alleles | |
±--------------±----------------±-------------+
| locus | array | array |
±--------------±----------------±-------------+
| chr20:87623 | [“T”,“C”] | [8] |
| chr20:96321 | [“T”,“C”] | [241] |
| chr20:96372 | [“C”,“T”] | [5] |
| chr20:145508 | [“T”,“C”] | [6] |
| chr20:145514 | [“GCAAA”,“G”] | [49] |
| chr20:145572 | [“T”,“C”] | [2] |

When I take another non-problematic datset cidr and run the same functions with it the output is the same (same variant positions) which is wrong because it does not have any issues with it, it runs fine.

I have also counted the number of allegedly problematic rows of mt rows with cidr (which does not have any in reality) and it is 11034 while with nisc I see 4689. So, the query - mt.filter_rows(hl.len(nisc[mt.row_key].info.AC) >= mt.a_index - 1) - does not seem to return correct results.

I verified that the number of items in the AC are 1 and a_index is 1 in these cases, so these are not problematic rows.