Loading many datasets into single VDS?

tpoterba · January 25, 2018, 10:53am

Hi Oron,
Unfortunately, this isn’t going to be easy. There are two problems:

The join function (which is named terribly by the way, it’s really a union_samples function) only accepts two datasets, and does work proportional to the samples joined, meaning that doing N joins for N VCFs does O(N**2) work! This is obviously unacceptable and we need to rethink it for 0.2.
Join does an inner join, so most likely you’d end up with no variants after doing this*! See here for a bit of recent discussion on why it’s not trivial to join VCFs with non-overlapping samples and variants: Joining Variant Datasets - Missed variant in inner join - Outer join

*if your VCFs are what I think they are, which is a gVCF with stripped reference-block information.

We should really have a way to import a list of gVCFs. I think Cloudera was thinking about contributing this at one point but don’t know the status there.

Topic		Replies	Views
Adding samples incrementally Feature Requests	4	2331	February 27, 2025
Join VDS's with same samples? Help [0.1]	1	872	October 23, 2017
Import multiple plink files Help [0.1]	12	1388	October 9, 2017
Combine multiple vcf files Hail Query & hailctl	1	1367	November 30, 2020
Combining VDSs with same samples Development	2	381	October 5, 2022

Loading many datasets into single VDS?

Related topics