Slow Terra Output

beneopp · February 23, 2024, 8:12pm

I am trying to use Hail in order to upload 1,000 genomes and HGDP data on Gnomad, remove low-quality variants and samples, and subset the variants to be in my exome data. However, it appears to be slow. I tried using the suggestions in this page for Notebook 1 hgdp_tgp/tutorials at master · atgu/hgdp_tgp · GitHub. However, the final number is 50,000 and it seems to take too long. Attached are the parameters I used. Does anyone have any advice on making it run more efficiently?

Here is the code I try to implement:

relateds_mt_without_outliers_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/pca_results/relateds_without_outliers.mt' 

relateds_mt_without_outliers_mt = hl.read_matrix_table(relateds_mt_without_outliers_path)

samples_mt = hl.import_vcf(samples_vcf_link, force_bgz=True, reference_genome='GRCh38')

relateds_mt_without_outliers_intersect_mt = relateds_mt_without_outliers_mt.filter_rows(hl.is_defined(samples_mt.index_rows(relateds_mt_without_outliers_mt.row_key)))

relateds_mt_without_outliers_intersect_mt.write("mt/relateds_mt_without_outliers_intersect.mt")

danking · February 28, 2024, 8:10pm

Hey @beneopp !

I generally prefer using autoscaling clusters but I’m not sure if those are available in Terra. Autoscaling clusters ensure Hail always uses as many cores as it can usefully use in parallel. This keeps runtime down and doesn’t create any added cost for you.

How long is “too long”? In general, a few thousand genomes is quite a lot of data. If the Spark progress bar indicates that you have, say, 10,000 partitions, then your cluster can only process about 20*8 = 160 partitions in parallel. Using 200 VMs will get you your answer ten times faster for the same cost. Note that you’ll need to shrink or shutdown the cluster after wards. This is why I strongly recommend people use autoscaling clusters.

How large (in bytes) is your VCF data?

The code you shared seems fine to me. That’s the fastest way to filter some VCF to the rows it shares with the gnomad PCA results. The only knobs you have left to turn are the number of cores you’re using.

beneopp · March 11, 2024, 5:23pm

To guess the runtime, I tracked 280 partitions completed in one minuted. It’s also worth noting I changed the workers to high memory. This would mean it will take me 178 minutes or more than 2 hours.

Topic		Replies	Views
Slow speed when using gnomadV3 callset Hail Query & hailctl	0	107	May 8, 2024
Running Hail on Terra -- how should I optimize? Hail Query & hailctl	7	1214	February 3, 2021
Hail/Apache Spark Not Scaling by Cluster Size Hail Query & hailctl	2	193	February 21, 2024
Importing large BGEN into Hail Matrix Table Hail Query & hailctl	4	459	July 2, 2021
Computation speed of hail aggregation Hail Query & hailctl	11	685	September 11, 2024

Slow Terra Output

Related topics