Error in Filtering Hail MT intervals

Hi there,

I’m looking to filter my Hail MT for intervals for genes of my interest. I have about 3000 intervals which have been processed and cleaned according to the requirements. However, I keep getting the following error:

ConnectionError: (‘Connection aborted.’, LineTooLong(‘got more than 1048576 bytes when reading header line’)

Is there an inherent limit to the number of filters I can have? I’m currently running on 8 CPUs and 52 GBs of RAM on the All of Us Researcher Workbench, so I don’t think my environment has any issues.

P.S. I have also reached out the the AoU team, but I wanted to see if anybody from the Hail could offer some help too :smile:

What is the code that you are currently running?

Sure, below in bold is the relevant code and a bit of the downstream steps I have. I have successfully ran this workflow for 3 genes before, but struggling with larger numbers.

#intervals is of length 3472 to be exact. Each interval corresponds to a particular gene of interest.
intervals = [‘chr7:113116717-113118553’, ‘chr7:113876776-113919008’, ‘chr16:53703962-54121940’…]

mt = hl.filter_intervals(mt, [hl.parse_locus_interval(x) for x in intervals[:1000]], keep=True)

→ I plan to filter the columns based on a list of samples I already have
mt = mt.semi_join_cols(samples)

→ Next I encode
mt_encoded = mt.annotate_entries(n_alt_alleles = mt.GT.n_alt_alleles())

→ Next I flatten the table such that I keep only the number of alternate alleles, variants in rows and my samples in columns
entries_table = mt_encoded.entries()

entries_table = entries_table.key_by()

→ Annotate a unique variant identifier (combining locus and alleles)
entries_table = entries_table.annotate(variant=hl.str(entries_table.locus) + “_” + hl.delimit(entries_table.alleles, “,”))

→ Select the relevant fields (sample ID, variant, and n_alt_alleles)
entries_table = entries_table.select(entries_table.s, entries_table.variant, entries_table.n_alt_alleles)

→ exporting into workspace bucket
entries_table.export(f’{bucket}/data/n_alt_alleles.csv’)

Note: this last step of exporting into my bucket is also taking a very long time (been waiting for 1.5 hours now), wondering if you have any thoughts on that.