Partition the VDS before querying it

Hi,
I’m trying to run an analysis on my laptop, and I want to partition the VDS and perform an analysis on each partition separately.
I am looking for a method similar to VDS.sample_variants(frac), but one that will allow me to run on a specific part of the data set each time (using a for loop).
I looked into the (excellent) docs thoroughly but I didn’t find exactly what I need, any help is appreciated :slight_smile:

Interesting use case - can you describe it a little more?

  • Should these partitions be genomic ranges, or random partitions of the dataset?
  • If random, are variants sampled with replacement?

If the partitions should be genomic ranges, then the filter intervals method is probably what you want – this will let you restrict toe a few MB of the dataset without needing to read + filter all the data. If you’re looking for random partitions, then we’ll have to think a little more.

I guess I need the partitions to be genomic ranges - I’m just not sure how to set the conditions for the filter_intervals method.

What I’m trying to do is:

  1. Import VCF file
  2. Turn the resulting VDS to a variants table
  3. Turn the table into a pandas dataframe
  4. perform analysis on the pandas dataframe (using just one column, ‘v.ref’).

Before step #3, I had to limit the size of my data set because of obvious memory limitations.
I tried using the filter_intervals method, but I don’t have a specific condition, I just need to get a limited batch of the data so it could be handled. I’m new to the domain so I’m probably missing some relevant knowledge in that regard.
I’ll try to illustrate with a pandas dataframe since I’m more familiar with it - I need to split a dataframe and perform an analysis on the first 10% of data, second 10% of data, etc.
I considered using Google Cloud but setting it up seems to complicated for my needs.
I hope that’s clearer :slight_smile:

You’d be surprised how easy it is to set up + run Hail on the cloud, but I think it should work locally too.

Here’s roughly what you’ll want to do:

from hail import *
hc = HailContext()
vds = hc.read('my.vds')
for chrom, start, end in my_ranges:
    region = vds.filter_intervals(Interval.parse('%s:%d-%d' % (chrom, start, end)))
    df = region.variants_table() \ 
               .annotate('ref = v.ref')  # this line and the next are here for performance reasons
               .select(['ref'])          # since we're only looking at the reference allele
               .to_pandas()
    do_my_analysis(df)

If it’s possible for you to share, what sorts of analysis are you doing with the reference allele? It may be possible to do it on the keytable itself, which will be naturally distributed and possibly faster.

Thank you!
The next question might be basic but… how do I know how to set start and end ranges?

I’ll try to elaborate on my analysis later today. Thank you for your quick replies! :slight_smile:

If you’ve got a file like:

tpoterba$ cat ranges.txt
1:0-10M
1:10M-20M
1:20M-30M
...

Then:

from hail import *
hc = HailContext()
vds = hc.read('my.vds')
with open('ranges.txt') as f:
  for line in f:
    interval = Interval.parse(line.strip())
      region = vds.filter_intervals(Interval.parse('%s:%d-%d' % (chrom, start, end)))
      df = region.variants_table() \ 
                 .annotate('ref = v.ref')  # this line and the next are here for performance reasons
                 .select(['ref'])          # since we're only looking at the reference allele
                 .to_pandas()
      do_my_analysis(df)

Without more details about what you’re doing, I can’t offer any advice on what sorts of range size to pick.