Difficulty annotating matrix table with phenotypes on Hail 0.2

We are having difficulty annotating our UKBB dataset on Hail 0.2 (5/30 commit). Our dataset consists of roughly 800k genotypes x 500k samples x 16k phenotypes.

We’re running on AWS EMR 5.12.1, Spark 2.2.0 with 4 x m4.10xlarge for worker nodes and auto scaling turned on. We did notice auto scaling kicking in when aggregating genotype data across chromosomes (only using m4.4xlarge before bumping up to do tests below) but not during our attempts to annotate.

Running into the same issue noted in another post, we tried the pattern:

mt_ukb = mt_ukb.annotate_cols(ukb = ph_table[mt_ukb.s])

rather than **expansion. Note: mt_ukb is the matrix table created by aggregating Plink imported data across chromosomes and is stored on S3; ph_table is our phenotype table imported from tsv format.

We were unsuccessful loading all 16k phenotypes in one call (failed after over an hour with “Executor heartbeat timed out after 178235 ms”).

Then we backed off and tried annotating with the first 20, 200, and 2000 phenotypes. 20 and 200 worked fine but 2000 failed.

Then we cut up the phenotype data into smaller tables with 1000 phenotypes and tried to incrementally annotate 1000 at a time using the same format above. The first 1000 took 100s, second took 320s, and the kernel died on the 3rd. Each of these tables is about 1.5G uncompressed, tsv. Note that the third set took 90s when it was loaded first after starting over.

Is it a realistic expectation to load more than ~2000 phenotypes? Or are we running into fundamental limitations on scaling?

If not, do you have suggestions on another approach?

We were looking to create one master dataset up front before allowing scientists to start performing QC and do more selective analysis.

Hi David,
The current infrastructure in the 0.2 build doesn’t scale to huge column tables – we store all the column data in local memory. We intend to change this, but there are higher-priority engineering tasks ahead of it. Your errors are probably related to memory, and the exponential annotation time for the incremental approach is explained by GC thrashing.

In the meantime, I would recommend that you store your UKBB MatrixTable without the phenotypes, and annotate them in batches when they’re necessary. The UKBB Rapid GWAS activity here took this approach – annotating groups of ~100-200 phenotypes at a time for regression.

Thanks for the quick reply and insight into what’s happening on the back end.

Happy to answer any questions about more specific usage patterns as well!

It’s also worth noting that the BGENv2 format is well supported in hail. A reasonable approach is to store the variant metadata and sample metadata as two Hail tables and to join them against the genotype BGEN data when necessary.