Filtering samples from VDS in Google cloud

Hi – I’m working in the AoU Google cloud environment. I’m trying to subset about 500 samples from the WGS Hail VDS and I’m not sure if I’m completely off the mark here or not. I’ve successfully worked with the WGS Hail MT for the exome regions, but this is my first time using the WGS VDS.

First I ran this command to filter for my samples:

vds = hl.vds.filter_samples(vds, samples, keep = False, remove_dead_alleles = True)

Then I ran this command to get an updated variant/sample count:

vds.variant_data.count()

However, when running the second command, my progress bar continues to freeze and random points. Today, I’ve been at “(15721 + 192) / 84648” for about the last 30 mins.

My compute cluster is as follows: 3 workers, 32 CPUs, 28.8GB ram + 3 preemptible workers

This cluster runs well, making good progress, until it doesn’t. Does anyone know if I am running into a resource issue? I am wondering if I should up the RAM. I am a bit lost since I haven’t worked with this large of a file previously.

Also, when I stop the command, the progress bar is updated, so I assume it is continuing to run. Like just now, when I stopped the command, the progress updated to “(22075 + 191) / 84648”

Thanks!

Hi @jasmussen !

Hail tables and matrix tables are recipes. Every time you perform an action, like count, it executes the entire recipe from start to finish. It’s hard to diagnose this issue without the full script. What happens before the count? Is it just a read and a filter samples?

It sounds like you’re working with an 84k partition dataset. This is an absolutely staggeringly large dataset. You won’t be able to work with this dataset with 32 cores. When working with this many partitions, it’s common to use thousands of “preemptible” or “spot” cores.

If there was a RAM issue, you’d almost certainly get an error about using too much memory. I can comment more on what’s going on if you share the full script. Can you also share which version of Hail you have installed? hl.version() should do it.

Hi @danking – Thank you for the response.

" What happens before the count? Is it just a read and a filter samples?" – Yes, it is just filtering the samples and removing any variants w/o a genotype in my 500 samples. I ultimately want to compare these samples to an external cohort that I’ve imported into the AoU cloud environment, so all I need are the variants with genotypes.

“It sounds like you’re working with an 84k partition dataset. This is an absolutely staggeringly large dataset. You won’t be able to work with this dataset with 32 cores.” --I know, this dataset is huge! I am working with the 245K release of AoU. And, I should have made my compute cluster more clear. I have 32 cores per worker (96 CPUs total) + 3 preemptible workes with 32 cores per worker (96 preemptible CPUs total). I actually had my cluster at 2 workers and 10 preemptible workers, but for the same two commands the progress bar was freezing. I was thinking that Google may have been taking my preemptible cores back?..

“If there was a RAM issue, you’d almost certainly get an error about using too much memory” – I haven’t ran into this issue for the WGS VDS yet. I did run into it when I first started working with the exome MT and you are right, a specific error about memory popped up. Thanks for the reminder!

" Can you also share which version of Hail you have installed? hl.version() should do it." – I am using: ‘0.2.126-ee77707f4fab’ This version was installed by AoU so I don’t think I can do much about it at this point in time.

Thank you again for responding and please let me know if you have any additional thoughts on it.

Hey @jasmussen !

Thanks for the details. Can you share the code you used to filter the variants and whatever you’re doing with the variants (e.g. write or export)?

The most efficient filtering command for VDS (in terms of data read, and, thus, time) is hl.vds.filter_intervals.

Unfortunately, the VDS kind of took off in popularity much faster than we are able to backfill documentation for it. In the meantime, the examples at hl.filter_intervals (which is the equivalent command but for matrix tables as opposed to VDSes) may be of use.

Hi @danking – I haven’t filtered the variants yet. I was going to use the command you suggested though. I was going to do something like this:
vds = hl.vds.filter_intervals(vds, [hl.parse_locus_interval(x,) for x in test_intervals])

My goal is to filter for exome regions we previously used for analysis in the UK Biobank. I have a bed file I’ve uploaded into AoU that defines the list of test_intervals.

I wasn’t sure which was best to do first - filter variants or samples. My instinct was to filter samples first to reduce the overall size of the VDS, then filter for variant regions. Please let me know if you have a different thought on the order of this approach!

@jasmussen ,

Hail cannot efficiently filter by samples. Hail matrix table (just like VCF) is stored “genome-wise” so we can rapid extract a set of intervals but not a sample set. As such, filtering by variants first is always preferable.

Can you share more context on why you only want to look at a 500 sample subset? How did you define this subset?

@danking – Thank you, this is great to know! I am still relatively new to Hail and was wondering if there was a preferred order for subsetting the matrix tables and VDS files.

I selected these samples b/c they are part of a control cohort and, as far as I can tell, are free of most major complex phenotypes.

All I really want to do is some basic per sample QC comparisons between different regions of the genome. I already did this analysis with the exome Hail MT, but my group feels intuitively something is a bit off. So, we want to compare regions in AoU to regions we have used in other cohorts, i.e. UKB, for a better head-to-head QC comparison.

@jasmussen ,

Thanks for the additional context!

Hmm. Yeah, the use-case of extracting a 500 sample subset from the 250k sample VDS is not well supported. It’s a somewhat challenging engineering challenge to support arbitrary subsets along both axes (genomic and sample), though, in your case, you’d really benefit from a copy of the data stored “sample-wise” which sacrifices fast genomic subsetting in favor of fast sample subsetting. Unfortunately that doesn’t exist! :frowning_face:

If you really want to extract a small sample subset and then do various down stream analyses, this is what I would try:

vds = hl.vds.read_vds(...)
vds = hl.vds.filter_samples(
    vds,
    the_five_hundred_samples,
    keep = True,
)
vds.write('gs://my-bucket/my-subset.vds')

This will, as rapidly as possible given the reality of VDS, extract the samples you want.

Now that you have a substantially reduced amount of data (1/500th). You can just densify your VDS to produce a dense Hail Matrix Table and apply your usual analytical techniques to that much simpler representation:

vds = hl.vds.read_vds('gs://my-bucket/my-subset.cds')
mt = hl.vds.to_dense_mt(vds)
mt.write('gs://my-bucket/my-dense-subset.mt')

Finally, adjust the partitioning since you have many fewer data:

mt = hl.read_matrix_table('gs://my-bucket/my-dense-subset.mt')
mt = mt.repartition(mt.n_partitions() / 500)
mt.write('gs://my-bucket/my-dense-subset-repartitioned.mt')

If it is available within AoU, I strongly recommend cluster autoscaling. If that is not available, I would use a relatively large cluster (at least 1,000 cores) (but for a hopefully short period of time) for the first step. The second and third step should be able to use a substantially smaller cluster given the reduced data.

The first and second steps can use exclusively spot/preemptible VMs (other than the required 2 non-spot VMs). The third step must use exclusively non-spot/non-preemptible VMs due to the nature of repartition.

@danking – Thank you so very much!! This is extremely helpful. I was going to take a similar approach as the one you suggested, but you added a few steps I would not have thought/known to add.

– I didn’t think to write the VDS to storage, but I will do that! I was going to densify the data then write to VCF to save the smaller file. I’ve had issues writing the Hail MT to the Google workspace bucket, but writing a VCF to the bucket doesn’t seem to be an issue. If I have trouble writing the VDS, I will proceed with my original plan of densifying first.

– I would not have known to repartition the data. What happens with that step and how does it benefit things?

– I will look at cluster autoscaling. I haven’t stumbled across it in the Research Support Hub, but I will intentionally look for it now. If I don’t find anything I will as tech support if that is possible. They probably know me by name at this point :laughing: Not sure if that is a good or bad thing!

1 Like

@jasmussen Hi Jennifer, it seems like we are conducting a similar analysis, but I am filtering for ~60,000 samples. Was wondering if support got back to you about the cluster autoscaling and what compute cluster setup was able to work for you on the AoU workbench. Also saw your other post about issues with genotypes after converting to the VCF and was curious about this.

@sarteaga – I never pursued the autoscaling approach for the VDS. I got side-tracked with a few other issues as you saw in my other post! My concern with the Hail MT to VCF export is all good now, but there are a few other concerns I need to work through related to variant QC and filtering.

Regarding your question, I have been subsetting the AoU Hail exome MT without issue. I haven’t tried to subset 60K samples, but I have had success with 5-6K sized cohorts. I have used about 400-500 CPUs (all but the 2 required workers are preemptible workers), I usually up the RAM a little bit, and this takes about 5-6 hours to subset the 5-6K samples from the Hail MT and write to VCF. This will cost about $10/hour. My grant funding is not NIH-based, so it might be cheaper for others with the STRIDES discount.

And, since my cohorts are smaller, I’ve been using bcftools to do additional filtering once I have the VCF. However, you may want to reimport your VCF to a Hail MT for a cohort of 60K. With that size, bcftools will take forever and a day.

Hope that helps!

Hello @danking, thanks for your response to this thread. I am running a similar analysis, however with over 10x the number of samples (~60K samples). Currently have subset the VDS by chromosome and then by samples, since this post and others mentioned “filtering by variants first is always preferable” and I am only looking at variants along 11 chromosomes. I was wondering if this is what you would recommend or if I should just filter the VDS by samples and then by chromosome before converting to a dense MT.