Import multi sample vcf

We have 100k sample vcfs “not gvcfs”, again sample plain vcfs. This is not a merged vcf, each vcf is a one sample. We need a function “append” to create merged vcf dataset (matrix or vds) and run quality control over it.
Can you please suggest any solution.

Can Hail handle such volume, 100k or more of sample vcfs not gvcf ? Each vcf is one sample, we do not have merged vcfs.

You can use Hail to do this. You’ll need to:

  1. Calculate a good partitioning for the files. (Look at MatrixTable._calculate_new_partitions)
  2. Import some number of VCFs using that partitioning.
  3. Use MatrixTable.full_outer_join to combine those MTs into one MT.
  4. Write that MT to a file.
  5. Repeat until you have no VCFs remaining.
  6. Recursively read and use full_outer_join until you have one MT.

We don’t have a standalone function for this because there’s no scientifically valid way to decide what to do at variants that are missing from one or more files.

Sounds good. I believe it can handle large volumes of vcfs 100k?

The Hail VDS Combiner has been used to jointly call 955,000 GVCF files. A similar approach (the one described above) should work for project VCF files; however, I have never specifically tried that.

Can it be assumed that there was no variant call made at that locus? Having such a function would be beneficial especially when one is trying to combine several samples with not much history. For example to perform a meta analysis.

Hey @hail_q !

As I understand it, the best you can do is assign those samples “NA” or some other representation of “missing information”. I think of this as different from “no call” (even though they’re both likely represented with an NA value) because a “no call” means that we did not get enough reads to decide on a particular call. In contrast, these new NAs mean “we have no idea how many reads are here”.

Put another way: a no call comes with overall depth, allele depth, and likelihoods. This new NA comes with no metadata. For all we know, that NA could be a very confident homozygous reference.

I’m not a statistical geneticist, but my understanding is that this new kind of NA complicates quality control efforts.