Partition the VDS before querying it

pretty-speeches · July 7, 2017, 7:55pm

Hi,
I’m trying to run an analysis on my laptop, and I want to partition the VDS and perform an analysis on each partition separately.
I am looking for a method similar to VDS.sample_variants(frac), but one that will allow me to run on a specific part of the data set each time (using a for loop).
I looked into the (excellent) docs thoroughly but I didn’t find exactly what I need, any help is appreciated

tpoterba · July 7, 2017, 8:13pm

Interesting use case - can you describe it a little more?

Should these partitions be genomic ranges, or random partitions of the dataset?
If random, are variants sampled with replacement?

If the partitions should be genomic ranges, then the filter intervals method is probably what you want – this will let you restrict toe a few MB of the dataset without needing to read + filter all the data. If you’re looking for random partitions, then we’ll have to think a little more.

pretty-speeches · July 7, 2017, 8:50pm

I guess I need the partitions to be genomic ranges - I’m just not sure how to set the conditions for the filter_intervals method.

What I’m trying to do is:

Import VCF file
Turn the resulting VDS to a variants table
Turn the table into a pandas dataframe
perform analysis on the pandas dataframe (using just one column, ‘v.ref’).

Before step #3, I had to limit the size of my data set because of obvious memory limitations.
I tried using the filter_intervals method, but I don’t have a specific condition, I just need to get a limited batch of the data so it could be handled. I’m new to the domain so I’m probably missing some relevant knowledge in that regard.
I’ll try to illustrate with a pandas dataframe since I’m more familiar with it - I need to split a dataframe and perform an analysis on the first 10% of data, second 10% of data, etc.
I considered using Google Cloud but setting it up seems to complicated for my needs.
I hope that’s clearer

tpoterba · July 7, 2017, 9:38pm

You’d be surprised how easy it is to set up + run Hail on the cloud, but I think it should work locally too.

Here’s roughly what you’ll want to do:

from hail import *
hc = HailContext()
vds = hc.read('my.vds')
for chrom, start, end in my_ranges:
    region = vds.filter_intervals(Interval.parse('%s:%d-%d' % (chrom, start, end)))
    df = region.variants_table() \ 
               .annotate('ref = v.ref')  # this line and the next are here for performance reasons
               .select(['ref'])          # since we're only looking at the reference allele
               .to_pandas()
    do_my_analysis(df)

If it’s possible for you to share, what sorts of analysis are you doing with the reference allele? It may be possible to do it on the keytable itself, which will be naturally distributed and possibly faster.

pretty-speeches · July 8, 2017, 6:16am

Thank you!
The next question might be basic but… how do I know how to set start and end ranges?

I’ll try to elaborate on my analysis later today. Thank you for your quick replies!

tpoterba · July 8, 2017, 1:12pm

If you’ve got a file like:

tpoterba$ cat ranges.txt
1:0-10M
1:10M-20M
1:20M-30M
...

Then:

from hail import *
hc = HailContext()
vds = hc.read('my.vds')
with open('ranges.txt') as f:
  for line in f:
    interval = Interval.parse(line.strip())
      region = vds.filter_intervals(Interval.parse('%s:%d-%d' % (chrom, start, end)))
      df = region.variants_table() \ 
                 .annotate('ref = v.ref')  # this line and the next are here for performance reasons
                 .select(['ref'])          # since we're only looking at the reference allele
                 .to_pandas()
      do_my_analysis(df)

tpoterba · July 8, 2017, 1:13pm

Without more details about what you’re doing, I can’t offer any advice on what sorts of range size to pick.

Topic		Replies	Views
Filtering samples from VDS in Google cloud Hail Batch & General Cloud	11	522	August 6, 2024
Slow Performance using filter_intervals Hail Query & hailctl	3	1033	November 28, 2018
Counting Rows More Quickly in VDS Hail Query & hailctl	12	534	July 17, 2023
Filter variants by sample id in gVCF Help [0.1]	20	1539	February 27, 2019
Most efficient way to filter and densify VDS Hail Query & hailctl	4	291	May 14, 2024

Partition the VDS before querying it

Related topics