Subset large vcf into multiple vcfs

samn · February 24, 2020, 4:37pm

Hi Hail team,
I previously have been able to subset very small vcfs around 55 samples, but not have had to do any subsetting at a large scale.

I am currently trying to use hail to subset large ~40k sample vcfs (by chromosome) into 48 subsetted vcfs. This would end up being 22 (chromosomes) * 48 (subsetted lists) = 1,056 subsetted vcfs.

Our team was never able to get a docker image up where we could run hail correctly in the cloud, so we are attempting to do this on prem using Uger. When attempting to subset just one of these vcfs the job ran for over one day and ended up just stopping without completion. Below is the script used to do the subsetting. Do you have any suggestions for how this could be completed?

#!/usr/bin/env python

import hail as hl
import sys
import os
import argparse
import logging

logging.basicConfig(
    format='%(levelname)s: %(asctime)s : %(message)s', level=logging.INFO
)

def parse_command_line():
    """Fail or return arguments parsed from the command line."""
    argparser = argparse.ArgumentParser(description=__doc__)
    argparser.add_argument('-i', '--input_vcf', required=True)
    argparser.add_argument('-t', '--tsv', required=True)
    argparser.add_argument('-c', '--chromosome', required=True)
    return argparser.parse_args()


def subset_vcf(mt, subset_list, subset_vcf_path):
    logging.info(f'subsetting {subset_vcf_path} from vcf')
    with open(subset_list) as f:
        # column one is the samples
        samples = [line.split('\t')[0] for line in f]
    mt = mt.filter_cols(hl.literal(samples).contains(mt.s), keep=True)
    # write out a new VCF
    logging.info(f'writting out {subset_vcf_path}')
    hl.export_vcf(mt, subset_vcf_path)


def create_directory(vcf_name):
    vcf_path = os.path.join('/seq/tng/newCloudJointGenotyping/PO-23593', vcf_name)
    if not os.path.exists(vcf_path):
        os.makedirs(vcf_path)
    return vcf_path


if __name__ == "__main__":
    args = parse_command_line()

    hl.init(default_reference='GRCh38')
    # read in the vcf. Replace nul with .
    logging.info('Loading in vcf')
    mt = hl.import_vcf(args.input_vcf, force=True, find_replace=('nul', '.'))

    with open(args.tsv) as f:
        for line in f:
            vcf_name = line.rstrip()
            subset_list = os.path.join(
                '/seq/tng/newCloudJointGenotyping/PO-23593/sampleLists',
                line.rstrip()
            )
            directory = create_directory(vcf_name)
            vcf_full_name = f'{vcf_name}.chr{args.chromosome}.vcf.bgz'
            vcf_path = os.path.join(directory, vcf_full_name)
            subset_vcf(mt, subset_list, vcf_path)

Thanks,
Sam

johnc1231 · February 24, 2020, 4:48pm

When you say it “stopped” without completion, was there some kind of error thrown? Do you have the hail log file?

Also, are you a Broad user / what cloud provider are you using? I suspect the hail experience on an on prem cluster isn’t going to great.

samn · February 24, 2020, 5:09pm

Hi @johnc1231,
I have attached the log file. I did not see any errors throw in there.

Yes, I am in DSP at Broad and use google cloud. Me and another team member previously came to hail office hours to discuss subsetting vcfs in the cloud using hail and we were told it would be easier to do this on prem. A major factor in that discussion was also the vcfs were so small.
Thanks,
Sam

hail-20200214-1424-0.2.20-dd6c996e7db5.log (992.2 KB)

danking · February 24, 2020, 5:12pm

Hey Sam, you probably need to allocate more memory to the Java process. How many variants are in the VCFs?

I do not remember how we arrived at the on-prem solution. It’s definitely a little complicated to securely get your GCP Service Account keys into a docker container. A VM would not have this issue. A Dataproc cluster also would not have this issue.

In general, the particular problem you’re trying to solve is not particularly well suited to Hail MatrixTable or VCFs, unfortunately. Either format requires you to read a whole row (variant) to read out any subset of samples from that row. @konradjk has some experience using Hail to perform this operation. Our medium-term plans include engineering a format that supports this use-case better, but there’s nothing usable yet.

samn · February 25, 2020, 8:06pm

Hi @danking, I used to have it on my calendar but can’t seem to find it… When is your next hail office hours?

johnc1231 · February 25, 2020, 10:00pm

It’s at 1pm in Serengetti on Thursday February 27th

ruchim · February 27, 2020, 2:58pm

@danking – any idea on how to raise memory for the on-prem backend of Hail?

danking · February 27, 2020, 3:07pm

Do you mean Hail in local mode? You’ll need to use PYSPARK_SUBMIT_ARGS.

Topic		Replies	Views
Cluster Size for Subsetting in Hail Hail Query & hailctl	3	395	March 10, 2020
Importing many sample-specific VCFs Hail Query & hailctl	12	1204	December 12, 2022
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1886	August 23, 2024
Trouble trying to subset a vcf Hail Query & hailctl	2	450	December 8, 2021
Trying to annotate vcf subset and then filter according to properties Hail Query & hailctl	9	48	March 21, 2025

Subset large vcf into multiple vcfs

Related topics