Vcf parsing: hail 0.2 vs cyvcf2 vs pysam

Hello guys,

I am developing a module to parse vcf file. Currently, I am using joined vcf file (which merged data from all individual vcf files).
After parsing data, I push the data to elasticsearch database. Actually, I want to develop a website which is similar with{"op"%3A"and"%2C"content"%3A[{"op"%3A"in"%2C"content"%3A{"field"%3A"genes.is_cancer_gene_census"%2C"value"%3A["true"]}}]}
I used cyvcf2 for the parsing module, and I can get all information needed now. But the performance of parsing module which uses cyvcf2 is not good enough. So I am considering hail0.2 as a alternative solution, because I found out that genomeAD used hail0.2 to parse vcf file. I have some questions:

  1. Should I work on joined vcf file? (we have totally 1000 samples) or I should work on individual vcf files?
  • If I work on joined file, it is easier to parse all necessary information, but I have to wait until I finished collecting and running analysis pipelines for 1000 samples, size of joined file may be very big.
  • If I work on individual vcf files, I have to deal with duplicated information, so I have to query to elasticsearch database to check for existence of a entry before index it to elasticsearch database, this raise performance issues
  1. Should I use hail0.2 or cyvcf2 or pysam or write my own code to do vcf parsing?
    Could you please help me answer above questions. Thanks in advance!

It’s pretty hard for me to answer (1), since it seems that this is a decision about which tradeoffs to choose – using the joined VCF will make things easier for you, as you point out, but will mean that you cannot add samples as they come in. Perhaps this will depend on (2), though.

To your question in (2), I think it’s first important to establish that Hail as a library fits a rather different use case than cyvcf2 and pysam. These libraries handle only parsing, returning VCF components as Python objects for you to manipulate. Hail, however, is a dataframe library like pandas/dplyr/pyspark which lazily executes full pipelines. Using Hail for the purpose of parsing a VCF and return Python objects isn’t going to be easy or perform well.

However, Hail has a few major benefits allowed by the interface/execution model:

  • Scalability. Hail can take advantage of a cluster, while cyvcf2 and pysam are single-machine tools (single-threaded, from a cursory look at the docs). You may be able to stream through files with these libraries, but trying to put a VCF of 1k WGS in memory isn’t going to go well.
  • Full-pipeline optimization. Because Hail executes pipelines lazily, it can apply transformations/optimizations to the entire pipeline rather than each component. This means Hail can maintain good performance when combining complicated library functions that may be expensive to naively execute in sequence.
  • Hail actually has an experimental export_elasticsearch method that can dump a Hail Table into ES.

Thanks @tpoterba. Could you please let me know which one was used to parse vcf files, and extract data for genomeAD website. I found this script, Was it used for vcf parsing?

The difficult work for gnomAD was not parsing VCF files, it was performing QC, computing hundreds (thousands?) of summary metrics from the data, and making that data accessible in an intuitive browser.

Hail is an excellent tool for doing work similar to the gnomAD QC and summary metric generation; it was designed in part for this project’s needs.

The browser team had a different challenge - taking the sites VCFs / Hail tables and making that data accessible in the gnomAD browser. Here the data is much smaller and more manageable, so the parsing strategy doesn’t really matter. I think they’re avoid VCF altogether now, though, in favor of using Hail tables generated by the analysis team and dumping those straight into the browser databases.