Difficulty annotating matrix table with phenotypes on Hail 0.2

David_Ficenec · June 17, 2018, 7:35pm

We are having difficulty annotating our UKBB dataset on Hail 0.2 (5/30 commit). Our dataset consists of roughly 800k genotypes x 500k samples x 16k phenotypes.

We’re running on AWS EMR 5.12.1, Spark 2.2.0 with 4 x m4.10xlarge for worker nodes and auto scaling turned on. We did notice auto scaling kicking in when aggregating genotype data across chromosomes (only using m4.4xlarge before bumping up to do tests below) but not during our attempts to annotate.

Running into the same issue noted in another post, we tried the pattern:

mt_ukb = mt_ukb.annotate_cols(ukb = ph_table[mt_ukb.s])

rather than **expansion. Note: mt_ukb is the matrix table created by aggregating Plink imported data across chromosomes and is stored on S3; ph_table is our phenotype table imported from tsv format.

We were unsuccessful loading all 16k phenotypes in one call (failed after over an hour with “Executor heartbeat timed out after 178235 ms”).

Then we backed off and tried annotating with the first 20, 200, and 2000 phenotypes. 20 and 200 worked fine but 2000 failed.

Then we cut up the phenotype data into smaller tables with 1000 phenotypes and tried to incrementally annotate 1000 at a time using the same format above. The first 1000 took 100s, second took 320s, and the kernel died on the 3rd. Each of these tables is about 1.5G uncompressed, tsv. Note that the third set took 90s when it was loaded first after starting over.

Is it a realistic expectation to load more than ~2000 phenotypes? Or are we running into fundamental limitations on scaling?

If not, do you have suggestions on another approach?

We were looking to create one master dataset up front before allowing scientists to start performing QC and do more selective analysis.

tpoterba · June 18, 2018, 12:11pm

Hi David,
The current infrastructure in the 0.2 build doesn’t scale to huge column tables – we store all the column data in local memory. We intend to change this, but there are higher-priority engineering tasks ahead of it. Your errors are probably related to memory, and the exponential annotation time for the incremental approach is explained by GC thrashing.

In the meantime, I would recommend that you store your UKBB MatrixTable without the phenotypes, and annotate them in batches when they’re necessary. The UKBB Rapid GWAS activity here took this approach – annotating groups of ~100-200 phenotypes at a time for regression.

David_Ficenec · June 18, 2018, 1:40pm

Thanks for the quick reply and insight into what’s happening on the back end.

tpoterba · June 18, 2018, 1:47pm

Happy to answer any questions about more specific usage patterns as well!

danking · July 2, 2018, 2:07pm

It’s also worth noting that the BGENv2 format is well supported in hail. A reasonable approach is to store the variant metadata and sample metadata as two Hail tables and to join them against the genotype BGEN data when necessary.

Topic		Replies	Views
Hail 0.2 - Attaching MatrixTable with phenotypes and getting an error Hail Query & hailctl	7	558	April 20, 2018
Annotate column with a list, then merge two matrix tables Hail Query & hailctl	13	1166	June 8, 2021
Help with "FatalError: MethodTooLargeException" and tables or matrices with many columns Hail Query & hailctl	7	999	January 22, 2021
Hail 0.2 - annotation of a table with ~10M rows fails after a few hour delay Hail Query & hailctl	8	809	April 30, 2018
Hail 0.2 - checking in about the error "Method code too large" Hail Query & hailctl	4	584	May 4, 2018

Difficulty annotating matrix table with phenotypes on Hail 0.2

Related topics