Subset large vcf into multiple vcfs

Hi Hail team,
I previously have been able to subset very small vcfs around 55 samples, but not have had to do any subsetting at a large scale.

I am currently trying to use hail to subset large ~40k sample vcfs (by chromosome) into 48 subsetted vcfs. This would end up being 22 (chromosomes) * 48 (subsetted lists) = 1,056 subsetted vcfs.

Our team was never able to get a docker image up where we could run hail correctly in the cloud, so we are attempting to do this on prem using Uger. When attempting to subset just one of these vcfs the job ran for over one day and ended up just stopping without completion. Below is the script used to do the subsetting. Do you have any suggestions for how this could be completed?

#!/usr/bin/env python

import hail as hl
import sys
import os
import argparse
import logging

logging.basicConfig(
    format='%(levelname)s: %(asctime)s : %(message)s', level=logging.INFO
)

def parse_command_line():
    """Fail or return arguments parsed from the command line."""
    argparser = argparse.ArgumentParser(description=__doc__)
    argparser.add_argument('-i', '--input_vcf', required=True)
    argparser.add_argument('-t', '--tsv', required=True)
    argparser.add_argument('-c', '--chromosome', required=True)
    return argparser.parse_args()


def subset_vcf(mt, subset_list, subset_vcf_path):
    logging.info(f'subsetting {subset_vcf_path} from vcf')
    with open(subset_list) as f:
        # column one is the samples
        samples = [line.split('\t')[0] for line in f]
    mt = mt.filter_cols(hl.literal(samples).contains(mt.s), keep=True)
    # write out a new VCF
    logging.info(f'writting out {subset_vcf_path}')
    hl.export_vcf(mt, subset_vcf_path)


def create_directory(vcf_name):
    vcf_path = os.path.join('/seq/tng/newCloudJointGenotyping/PO-23593', vcf_name)
    if not os.path.exists(vcf_path):
        os.makedirs(vcf_path)
    return vcf_path


if __name__ == "__main__":
    args = parse_command_line()

    hl.init(default_reference='GRCh38')
    # read in the vcf. Replace nul with .
    logging.info('Loading in vcf')
    mt = hl.import_vcf(args.input_vcf, force=True, find_replace=('nul', '.'))

    with open(args.tsv) as f:
        for line in f:
            vcf_name = line.rstrip()
            subset_list = os.path.join(
                '/seq/tng/newCloudJointGenotyping/PO-23593/sampleLists',
                line.rstrip()
            )
            directory = create_directory(vcf_name)
            vcf_full_name = f'{vcf_name}.chr{args.chromosome}.vcf.bgz'
            vcf_path = os.path.join(directory, vcf_full_name)
            subset_vcf(mt, subset_list, vcf_path)

Thanks,
Sam

When you say it “stopped” without completion, was there some kind of error thrown? Do you have the hail log file?

Also, are you a Broad user / what cloud provider are you using? I suspect the hail experience on an on prem cluster isn’t going to great.

Hi @johnc1231,
I have attached the log file. I did not see any errors throw in there.

Yes, I am in DSP at Broad and use google cloud. Me and another team member previously came to hail office hours to discuss subsetting vcfs in the cloud using hail and we were told it would be easier to do this on prem. A major factor in that discussion was also the vcfs were so small.
Thanks,
Sam

hail-20200214-1424-0.2.20-dd6c996e7db5.log (992.2 KB)

Hey Sam, you probably need to allocate more memory to the Java process. How many variants are in the VCFs?

I do not remember how we arrived at the on-prem solution. It’s definitely a little complicated to securely get your GCP Service Account keys into a docker container. A VM would not have this issue. A Dataproc cluster also would not have this issue.

In general, the particular problem you’re trying to solve is not particularly well suited to Hail MatrixTable or VCFs, unfortunately. Either format requires you to read a whole row (variant) to read out any subset of samples from that row. @konradjk has some experience using Hail to perform this operation. Our medium-term plans include engineering a format that supports this use-case better, but there’s nothing usable yet.

Hi @danking, I used to have it on my calendar but can’t seem to find it… When is your next hail office hours?

It’s at 1pm in Serengetti on Thursday February 27th

@danking – any idea on how to raise memory for the on-prem backend of Hail?

Do you mean Hail in local mode? You’ll need to use PYSPARK_SUBMIT_ARGS.