OK, I’ve been meaning to write most of this out anyway, so this is a good chance to do that. I start with some background and definitions before giving you practical advice.
We view sequencing as a series of transformations of data:
DNA ==sequencer=> FASTQ ==alignment=> BAM/CRAM ==variant caller=> GVCF ==joint caller=> PVCF
The FASTQ, BAM/CRAM, and GVCF datatypes all contain one sample per file. A PVCF contains zero or more samples per file.
The conversion from BAM/CRAM to GVCF is lossy. The reads are replaced by a set of statistics at each locus. For example, every locus will have a depth (DP) representing the number of reads that overlapped that locus. Every locus will also have an allele depth (AD) representing, per-allele, the number of reads that were evidence for the allele at that locus. For example, at a heterozygous locus, you might have a depth of 30 and an reference allele depth of 14 and an alternate allele depth of 15. Notice that there was one so-called “uninformative” read.
Moreover, the GVCF format is sparse. Instead of representing every locus with one line of the file, intervals of homozyogus reference loci are represented as a single line. These intervals are called “reference blocks”. Reference blocks have no alternate alleles, so there is no allele depth (AD) or phred-scaled likelihood (PL). Instead there is usually an overall depth (DP) and a genotype quality (GQ). The latter indicates our confidence in the homozygous reference calls in this interval. For example, if we only see five overlapping reads, we might not be confident that these are a reference calls.
Homozygous reference quality information is critical. If we entirely elided those reference blocks, we are unable to confidently say that the sample is homozygous reference at that locus. The sample could be either homozygous reference or “no call”. The latter is a form of missing data: we did not reliably observe the genotype at that locus for this sample.
A PVCF is always created from two or more GVCFs. The conversion from GVCF to PVCF is also lossy because the reference blocks are eliminated. This is not a problem for GWAS because the phenotype, which varies per sample, cannot correlate with a locus lacking variation.
The loss of reference blocks prevents adding new samples to an existing PVCF or combining two PVCFs. In either case, there may exist a variant at which one dataset has a variant record and the other dataset has no record at all. In that case, for samples without a record, we know only that they have either a homozygous reference call or “no call” (aka unknown genotype aka missing data).
In terms of bits-and-bytes, GVCF and PVCF are both valid VCFs though each has a different set of conventions. Most tools cannot process every valid VCF. Instead, they make assumptions about the VCF structure and only work on VCFs that satisfy those assumptions. There is a detailed specification of VCF version 4.3. There are also a number of descriptions of GVCF: Broad 1, Broad 2, and Illumina.
There is more to say about PVCFs, but let me first address PLINK and BGEN. There are actually two PLINK formats: PLINK1 (bed) and PLINK2 (pgen). I am not too familiar with either format, but I understand that they are meant to store “hard calls” meaning just the genotypes or the dosages. In particular, they cannot store sequencing metadata such as DP, AD, and PL. I believe PLINK2 supports multi-allelic sites. There are several versions of BGEN, but they are all essentially designed to store the genotype probabilities. BGEN 1.2 and later supports multi-allelic variants. Again, as far as I know, there is no support for arbitrary sequencing metadata fields.
We already noted that two PVCFs cannot be combined, but there is yet another problem: the typical representation of sequencing data in PVCFs grow super-linearly in the samples. In particular, at multi-allelic sites, each sample often adds another alternate allele. As a result, the AD array grows roughly linearly in samples. Moreover, the PL array, which has an entry for every possible pair of alleles, grows roughly quadratically in samples. The true relationship is a bit complicated, but empirically, we have observed super-linear scaling in samples of PVCFs.
GLNexus mitigates the size problem by eliding the PL array for homozygous reference calls. I’m not sure what other joint-callers do. In our experience, a PVCF with real PLs of 100,000 samples is prohibitively large. So large that, as far as we know, no tool can produce such a PVCF. There is a stalled discussion about modifying the VCF spec to encode AD and PL in O(1) space in the samples.
Hail has three fundamental data types: the Table, the MatrixTable, and the BlockMatrix. A table is what it sounds like. A BlockMatrix is for linear algebra. The MatrixTable is, essentially, a totally generic version of a VCF. When you run
import_vcf you get a MatrixTable. MatrixTables are intended for use with “dense” data, such as normal PVCFs.
On top of these fundamental data types, we have built the Variant Dataset or VDS. A VDS is combinable: one or more VDSes may be combined with one or more GVCFs to cheaply produce a new VDS. A VDS succinctly represents AD, PL, and other sequencing metadata. The size of a VDS grows linearly in the samples and variants.
Generating a VDS from exomes costs 0.01 USD per exome on Google Dataproc, using autoscaling policies. Generating a VDS from genomes costs 0.10 USD per genome.
If you do not intend to add new samples to a dataset and also your dataset contains fewer than 10,000 samples, you should generate a PVCF. This representation is dense and therefore easier to use.
If someone has already produced a PVCF for you, you should use that. But be aware! Because of the issues stated above, some tools or methods (including those in Hail!) might not work correctly because fields missing.
If your goal is a GWAS and someone you trust has already produced a set of a high-quality variants and samples, I recommend filtering the BGEN or PLINK file to those samples and using that for GWAS.
If you want to do your own QC or do something different than GWAS, then Hail might be a good tool. You should use the PVCF that already exists.
Working in the cloud is really challenging. I have never used AWS for analysis, so my suggestions are necessarily vague. Make sure to use an auto-scaling cluster. Start small with just a handful of variants until you feel confident in your analysis. Do not convert a BGEN file to a Hail MatrixTable. BGEN is a more succinct format than Hail MatrixTable format. Hail has great support for BGEN. I wrote more details about using Hail on the cloud here.