Loading many datasets into single VDS?

Oron_Navon · January 25, 2018, 10:48am

Hi Hail gurus, thanks for this great tool.

I have a big collection of VCF files, one per individual, sitting in a bunch of S3 folders. Each VCF contains several million SNPs.
Is there a way of loading all of them into one big VDS?

I tried importing them one-by-one with import_vcf, then joining using VariantDataset.join(), but join() only accepts one VDS - I need to join a list.

Thanks, any help greatly appreciated,
Oron

tpoterba · January 25, 2018, 10:53am

Hi Oron,
Unfortunately, this isn’t going to be easy. There are two problems:

The join function (which is named terribly by the way, it’s really a union_samples function) only accepts two datasets, and does work proportional to the samples joined, meaning that doing N joins for N VCFs does O(N**2) work! This is obviously unacceptable and we need to rethink it for 0.2.
Join does an inner join, so most likely you’d end up with no variants after doing this*! See here for a bit of recent discussion on why it’s not trivial to join VCFs with non-overlapping samples and variants: Joining Variant Datasets - Missed variant in inner join - Outer join

*if your VCFs are what I think they are, which is a gVCF with stripped reference-block information.

We should really have a way to import a list of gVCFs. I think Cloudera was thinking about contributing this at one point but don’t know the status there.

Oron_Navon · January 25, 2018, 10:58am

ok, not the answer I was hoping for, but thanks Tim for the quick and concise reply!

jjfarrell · January 31, 2018, 4:38pm

This is not a trivial problem to scale. Here is a GATK4 video on Scaling up joint calling with GenomicsDB that discusses the issue and the solution.

github.com

broadinstitute/gatk-docs/blob/master/blog-2012-to-2019/2018-01-01-Scaling_up_joint_calling_with_GenomicsDB.md?id=11105

## Scaling up joint calling with GenomicsDB

By Geraldine_VdAuwera

<p>What's new in GATK4? In this short video, Laura Gauthier explains how the speed and scalability of joint calling is dramatically improved in GATK4 thanks to the Intel GenomicsDB datastore.</p>


<span class="VideoWrap"><span class="Video YouTube" data-youtube="youtube-ap2aJKbJON0?autoplay=1"><span class="VideoPreview"><a href="https://www.youtube.com/watch?v=ap2aJKbJON0"><img src="https://img.youtube.com/vi/ap2aJKbJON0/0.jpg" width="640" height="385" border="0" class="embedImage-img importedEmbed-img"></img></a></span><span class="VideoPlayer"></span></span></span>

So the GATK tools are used to create a single joint-genotyped VCF from multiple gVCFs. The single VCF which is then converted into the VDS with Hail.

Topic		Replies	Views
Join VDS's with same samples? Help [0.1]	1	862	October 23, 2017
Import multiple plink files Help [0.1]	12	1353	October 9, 2017
Joining Variant Datasets - Missed variant in inner join - Outer join Feature Requests	2	1329	January 25, 2018
Store multiple vcfs into single MatrixTable Hail Query & hailctl	10	770	September 9, 2020
Combine multiple vcf files Hail Query & hailctl	1	1310	November 30, 2020

Loading many datasets into single VDS?

Related topics