Slow Terra Output

I am trying to use Hail in order to upload 1,000 genomes and HGDP data on Gnomad, remove low-quality variants and samples, and subset the variants to be in my exome data. However, it appears to be slow. I tried using the suggestions in this page for Notebook 1 hgdp_tgp/tutorials at master · atgu/hgdp_tgp · GitHub. However, the final number is 50,000 and it seems to take too long. Attached are the parameters I used. Does anyone have any advice on making it run more efficiently?

Here is the code I try to implement:

relateds_mt_without_outliers_path = 'gs://gcp-public-data--gnomad/release/3.1/secondary_analyses/hgdp_1kg/pca_results/relateds_without_outliers.mt' 

relateds_mt_without_outliers_mt = hl.read_matrix_table(relateds_mt_without_outliers_path)

samples_mt = hl.import_vcf(samples_vcf_link, force_bgz=True, reference_genome='GRCh38')

relateds_mt_without_outliers_intersect_mt = relateds_mt_without_outliers_mt.filter_rows(hl.is_defined(samples_mt.index_rows(relateds_mt_without_outliers_mt.row_key)))

relateds_mt_without_outliers_intersect_mt.write("mt/relateds_mt_without_outliers_intersect.mt")

Hey @beneopp !

I generally prefer using autoscaling clusters but I’m not sure if those are available in Terra. Autoscaling clusters ensure Hail always uses as many cores as it can usefully use in parallel. This keeps runtime down and doesn’t create any added cost for you.

How long is “too long”? In general, a few thousand genomes is quite a lot of data. If the Spark progress bar indicates that you have, say, 10,000 partitions, then your cluster can only process about 20*8 = 160 partitions in parallel. Using 200 VMs will get you your answer ten times faster for the same cost. Note that you’ll need to shrink or shutdown the cluster after wards. This is why I strongly recommend people use autoscaling clusters.

How large (in bytes) is your VCF data?

The code you shared seems fine to me. That’s the fastest way to filter some VCF to the rows it shares with the gnomad PCA results. The only knobs you have left to turn are the number of cores you’re using.

To guess the runtime, I tracked 280 partitions completed in one minuted. It’s also worth noting I changed the workers to high memory. This would mean it will take me 178 minutes or more than 2 hours.