Hello,
Hope all is well.
I’m new to Hail and currently doing some analysis on the AllofUs dataset which is hosted on Google Cloud, also new to using it. I’m doing a simple polygenic risk score calculation and need to export the data to a pandas data frame but the .to_pandas() part is so slow even when allocating the maximum resources available and it’s expensive to do so.
I tried mt.key_cols_by().cols().flatten().to_pandas() but didn’t make any difference. The MT has only 3 columns and 250,000 rows and looks like this:
s sum_weights N_variants
0 0.7703 11
1 0.7794 10
2 0.8266 11
3 0.7202 11
4 0.7843 11
I also tried mt.collect() but is also very slow.
I need to extract the data to be used in a ML model and was wondering if anyone experienced the same thing. Any advice is much appreciated!
Thank you.
How did you define mt
?
Hail is “lazy”, everything you wrote to define mt
is getting run when you do to_pandas
. It looks like you’re probably performing an aggregation to compute N_variants and sum_weights, that aggregation has to look at all the rows of the MT. If there are a lot of rows (or if this is the result of densifying a vds) there is a staggering amount of data to process.
Thank you for the prompt response!
Yes, that makes sense. I filter to keep only the variants needed for the PRS then aggregate to get both N_variants and sum_weights.
mt_prs = mt_prs.annotate_entries(
effect_allele_count=calculate_effect_allele_count(mt_prs),
weighted_count=calculate_effect_allele_count(mt_prs) * mt_prs.prs_info[‘weight’])
mt_prs = mt_prs.annotate_cols(
sum_weights=hl.agg.sum(mt_prs.weighted_count),
N_variants=hl.agg.count_where(hl.is_defined(mt_prs.weighted_count)))
So I guess I’ll need more resources then.
This operation is particularly expensive because you’re streaming through all the sequencing data only to read, I assume, the genotypes. You should identify the superset of variants you care about and extract them. You can either store as a Hail MatrixTable with just one entry field: GT
, a call, (or even n_alt_alleles
, an int32) or you can store in some other format you prefer. PLINK is probably fine too. The important thing is to get rid of the PL, AD, etc. You’re probably bottlenecked on network bandwidth right now.
And, FWIW, we’re working on a columnar based storage system so that Hail can avoid reading irrelevant fields, but that’s at least a year away.
Great! Thank you so much for your help!!