We are having difficulty annotating our UKBB dataset on Hail 0.2 (5/30 commit). Our dataset consists of roughly 800k genotypes x 500k samples x 16k phenotypes.
We’re running on AWS EMR 5.12.1, Spark 2.2.0 with 4 x m4.10xlarge for worker nodes and auto scaling turned on. We did notice auto scaling kicking in when aggregating genotype data across chromosomes (only using m4.4xlarge before bumping up to do tests below) but not during our attempts to annotate.
Running into the same issue noted in another post, we tried the pattern:
mt_ukb = mt_ukb.annotate_cols(ukb = ph_table[mt_ukb.s])
rather than **expansion. Note: mt_ukb is the matrix table created by aggregating Plink imported data across chromosomes and is stored on S3; ph_table is our phenotype table imported from tsv format.
We were unsuccessful loading all 16k phenotypes in one call (failed after over an hour with “Executor heartbeat timed out after 178235 ms”).
Then we backed off and tried annotating with the first 20, 200, and 2000 phenotypes. 20 and 200 worked fine but 2000 failed.
Then we cut up the phenotype data into smaller tables with 1000 phenotypes and tried to incrementally annotate 1000 at a time using the same format above. The first 1000 took 100s, second took 320s, and the kernel died on the 3rd. Each of these tables is about 1.5G uncompressed, tsv. Note that the third set took 90s when it was loaded first after starting over.
Is it a realistic expectation to load more than ~2000 phenotypes? Or are we running into fundamental limitations on scaling?
If not, do you have suggestions on another approach?
We were looking to create one master dataset up front before allowing scientists to start performing QC and do more selective analysis.