Vcf parsing: hail 0.2 vs cyvcf2 vs pysam

toandd · November 19, 2019, 2:25am

Hello guys,

I am developing a module to parse vcf file. Currently, I am using joined vcf file (which merged data from all individual vcf files).
After parsing data, I push the data to elasticsearch database. Actually, I want to develop a website which is similar with https://portal.gdc.cancer.gov/exploration?filters={"op"%3A"and"%2C"content"%3A[{"op"%3A"in"%2C"content"%3A{"field"%3A"genes.is_cancer_gene_census"%2C"value"%3A["true"]}}]}
I used cyvcf2 for the parsing module, and I can get all information needed now. But the performance of parsing module which uses cyvcf2 is not good enough. So I am considering hail0.2 as a alternative solution, because I found out that genomeAD used hail0.2 to parse vcf file. I have some questions:

Should I work on joined vcf file? (we have totally 1000 samples) or I should work on individual vcf files?

If I work on joined file, it is easier to parse all necessary information, but I have to wait until I finished collecting and running analysis pipelines for 1000 samples, size of joined file may be very big.
If I work on individual vcf files, I have to deal with duplicated information, so I have to query to elasticsearch database to check for existence of a entry before index it to elasticsearch database, this raise performance issues

Should I use hail0.2 or cyvcf2 or pysam or write my own code to do vcf parsing?
Could you please help me answer above questions. Thanks in advance!

tpoterba · November 19, 2019, 10:40am

It’s pretty hard for me to answer (1), since it seems that this is a decision about which tradeoffs to choose – using the joined VCF will make things easier for you, as you point out, but will mean that you cannot add samples as they come in. Perhaps this will depend on (2), though.

To your question in (2), I think it’s first important to establish that Hail as a library fits a rather different use case than cyvcf2 and pysam. These libraries handle only parsing, returning VCF components as Python objects for you to manipulate. Hail, however, is a dataframe library like pandas/dplyr/pyspark which lazily executes full pipelines. Using Hail for the purpose of parsing a VCF and return Python objects isn’t going to be easy or perform well.

However, Hail has a few major benefits allowed by the interface/execution model:

Scalability. Hail can take advantage of a cluster, while cyvcf2 and pysam are single-machine tools (single-threaded, from a cursory look at the docs). You may be able to stream through files with these libraries, but trying to put a VCF of 1k WGS in memory isn’t going to go well.
Full-pipeline optimization. Because Hail executes pipelines lazily, it can apply transformations/optimizations to the entire pipeline rather than each component. This means Hail can maintain good performance when combining complicated library functions that may be expensive to naively execute in sequence.
Hail actually has an experimental export_elasticsearch method that can dump a Hail Table into ES.

toandd · November 20, 2019, 2:52am

Thanks @tpoterba. Could you please let me know which one was used to parse vcf files, and extract data for genomeAD website. I found this script https://github.com/macarthur-lab/gnomad_browser/blob/master/parsing.py, Was it used for vcf parsing?

tpoterba · November 20, 2019, 3:14am

The difficult work for gnomAD was not parsing VCF files, it was performing QC, computing hundreds (thousands?) of summary metrics from the data, and making that data accessible in an intuitive browser.

Hail is an excellent tool for doing work similar to the gnomAD QC and summary metric generation; it was designed in part for this project’s needs.

The browser team had a different challenge - taking the sites VCFs / Hail tables and making that data accessible in the gnomAD browser. Here the data is much smaller and more manageable, so the parsing strategy doesn’t really matter. I think they’re avoid VCF altogether now, though, in favor of using Hail tables generated by the analysis team and dumping those straight into the browser databases.

Topic		Replies	Views
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1861	August 23, 2024
Best strategy for annotating and filtering VCF files using HAIL-VEP on UKB RAP? Hail Query & hailctl	3	1467	June 17, 2022
I cannot import the UKB 200K WGS VCFs in Hail due to an empty line in the VCF after the header lines Hail Query & hailctl	2	559	July 7, 2022
Problems with filtering a large (ie. 6.5GB) gvcf file based on BED file with Hail Hail Query & hailctl	2	398	July 19, 2023
VCF format for hail in seqr Help [0.1]	19	1296	September 19, 2018

Vcf parsing: hail 0.2 vs cyvcf2 vs pysam

Related topics