Mt.key_cols_by().cols().flatten().to_pandas() is too slow

akhattab · August 7, 2023, 8:32pm

Hello,
Hope all is well.
I’m new to Hail and currently doing some analysis on the AllofUs dataset which is hosted on Google Cloud, also new to using it. I’m doing a simple polygenic risk score calculation and need to export the data to a pandas data frame but the .to_pandas() part is so slow even when allocating the maximum resources available and it’s expensive to do so.

I tried mt.key_cols_by().cols().flatten().to_pandas() but didn’t make any difference. The MT has only 3 columns and 250,000 rows and looks like this:
s sum_weights N_variants
0 0.7703 11
1 0.7794 10
2 0.8266 11
3 0.7202 11
4 0.7843 11

I also tried mt.collect() but is also very slow.
I need to extract the data to be used in a ML model and was wondering if anyone experienced the same thing. Any advice is much appreciated!
Thank you.

danking · August 7, 2023, 9:10pm

How did you define mt?

Hail is “lazy”, everything you wrote to define mt is getting run when you do to_pandas. It looks like you’re probably performing an aggregation to compute N_variants and sum_weights, that aggregation has to look at all the rows of the MT. If there are a lot of rows (or if this is the result of densifying a vds) there is a staggering amount of data to process.

akhattab · August 7, 2023, 9:17pm

Thank you for the prompt response!
Yes, that makes sense. I filter to keep only the variants needed for the PRS then aggregate to get both N_variants and sum_weights.

mt_prs = mt_prs.annotate_entries(
effect_allele_count=calculate_effect_allele_count(mt_prs),
weighted_count=calculate_effect_allele_count(mt_prs) * mt_prs.prs_info[‘weight’])

mt_prs = mt_prs.annotate_cols(
sum_weights=hl.agg.sum(mt_prs.weighted_count),
N_variants=hl.agg.count_where(hl.is_defined(mt_prs.weighted_count)))

So I guess I’ll need more resources then.

danking · August 7, 2023, 9:34pm

This operation is particularly expensive because you’re streaming through all the sequencing data only to read, I assume, the genotypes. You should identify the superset of variants you care about and extract them. You can either store as a Hail MatrixTable with just one entry field: GT, a call, (or even n_alt_alleles, an int32) or you can store in some other format you prefer. PLINK is probably fine too. The important thing is to get rid of the PL, AD, etc. You’re probably bottlenecked on network bandwidth right now.

danking · August 7, 2023, 9:35pm

And, FWIW, we’re working on a columnar based storage system so that Hail can avoid reading irrelevant fields, but that’s at least a year away.

akhattab · August 7, 2023, 9:50pm

Great! Thank you so much for your help!!

Topic		Replies	Views
Speed up of mt.entries().to_pandas() Hail Query & hailctl	2	822	September 11, 2020
Spark Tweak - to_pandas function Hail Query & hailctl	39	2493	October 23, 2018
Exporting data from MatrixTable into TSV Hail Query & hailctl	7	1399	December 14, 2021
Can Hail convert mt.show() output to dataframe Hail Query & hailctl	3	437	January 13, 2023
Export column of table as pandas or numpy.ndarray Hail Query & hailctl	3	466	September 12, 2020

Mt.key_cols_by().cols().flatten().to_pandas() is too slow

Related topics