Error in Filtering Hail MT intervals

Nitish_Aswani · December 13, 2024, 10:35pm

Hi there,

I’m looking to filter my Hail MT for intervals for genes of my interest. I have about 3000 intervals which have been processed and cleaned according to the requirements. However, I keep getting the following error:

ConnectionError: (‘Connection aborted.’, LineTooLong(‘got more than 1048576 bytes when reading header line’)

Is there an inherent limit to the number of filters I can have? I’m currently running on 8 CPUs and 52 GBs of RAM on the All of Us Researcher Workbench, so I don’t think my environment has any issues.

P.S. I have also reached out the the AoU team, but I wanted to see if anybody from the Hail could offer some help too

chrisvittal · December 14, 2024, 12:51am

What is the code that you are currently running?

Nitish_Aswani · December 14, 2024, 1:37am

Sure, below in bold is the relevant code and a bit of the downstream steps I have. I have successfully ran this workflow for 3 genes before, but struggling with larger numbers.

#intervals is of length 3472 to be exact. Each interval corresponds to a particular gene of interest.
intervals = [‘chr7:113116717-113118553’, ‘chr7:113876776-113919008’, ‘chr16:53703962-54121940’…]

mt = hl.filter_intervals(mt, [hl.parse_locus_interval(x) for x in intervals[:1000]], keep=True)

→ I plan to filter the columns based on a list of samples I already have
mt = mt.semi_join_cols(samples)

→ Next I encode
mt_encoded = mt.annotate_entries(n_alt_alleles = mt.GT.n_alt_alleles())

→ Next I flatten the table such that I keep only the number of alternate alleles, variants in rows and my samples in columns
entries_table = mt_encoded.entries()

entries_table = entries_table.key_by()

→ Annotate a unique variant identifier (combining locus and alleles)
entries_table = entries_table.annotate(variant=hl.str(entries_table.locus) + “_” + hl.delimit(entries_table.alleles, “,”))

→ Select the relevant fields (sample ID, variant, and n_alt_alleles)
entries_table = entries_table.select(entries_table.s, entries_table.variant, entries_table.n_alt_alleles)

→ exporting into workspace bucket
entries_table.export(f’{bucket}/data/n_alt_alleles.csv’)

Note: this last step of exporting into my bucket is also taking a very long time (been waiting for 1.5 hours now), wondering if you have any thoughts on that.

Topic		Replies	Views
HailException: invalid interval expression Help [0.1]	1	1074	July 31, 2018
Parse_locus() over multiple regions using GRCh38 Hail Query & hailctl	9	1390	March 24, 2023
Filter mt rows on ht error Hail Query & hailctl	10	660	April 7, 2020
Counting number of variants for each interval Hail Query & hailctl	3	393	September 22, 2022
Using filter_intervals to query the 16GB gnomAD sites dataset in 300ms on your laptop Updates	0	4448	March 4, 2017

Error in Filtering Hail MT intervals

Related topics