Optimise querying code

nonchev · September 14, 2020, 11:28am

Hello,

I tried several different approaches for my project firstly with MatrixTable, but the performance was bad.
Now I am trying to work only with Tables. My pipeline has 2 steps.

step 1: injest vcf, i.e import vcf files into Tables and store them in a folder(table per vcf approach)
step 2: querying Tables to get GT, GQ or DP fields for given set of variants and samples:

code.txt (1.6 KB)
or

Can you look into this code snippet and give me an advice/suggestion how can I improve/optimise the code, because I am disappointed by its performance?
note: I tried to add checkpoints or .persist() but the different wasn’t significant, but maybe they were on the wrong place…
Thanks in advance!

Cheers,

johnc1231 · September 14, 2020, 1:56pm

I don’t understand what you’re doing yet. You make a variant table from pandas, then a sample table from pandas, then you join them together? Are either of these tables keyed? If not, I’m pretty sure join is just going to do a cartesian product, which doesn’t seem like what you’d want?

tpoterba · September 14, 2020, 2:02pm

Knowing the input here would help a lot. What is var_df?

I think one of the core problems, though, is that this pipeline scales with the square of the number of samples. If you find yourself using Python for loops to iteratively filter and collect pieces of a Hail table or matrix table, this is a red flag. the line

res_np.append(np.array(res_table.filter(res_table.s==sample)[field].collect()).reshape(-1, 1))

Requires a full pass over the table to filter to rows that have the sample of interest. You’re doing this once per sample, hence quadratic scaling.

The way you’re iterating to build a union table is also concerning, I think – there may be a better way. What are these ht_vcf inputs? These are from VCFs that represent the same samples, but different variants?

nonchev · September 14, 2020, 2:19pm

var_df is the input from the user: dataframe with variants
Screenshot 2020-09-14 at 17.06.18

samples:
Screenshot 2020-09-14 at 17.07.52

here I am building the cartesian product between between samples and variants:

ht_vcf - a vcf file which can contain variants for single or multiple samples. ht_paths are the different vcf files which contain information for my samples of interest.

ht_n :

nonchev · September 14, 2020, 2:25pm

Story:
A user gives his variants and samples of interest, and I want to query my vcf files(stored as tables) and return the GT,DP or GQ field (0 if the variant is not present for this sample) (get_geneotype calls _get(…, ‘GT’))

danking · September 14, 2020, 2:40pm

Computing the Cartesian product of samples and variants will not perform well. It looks like you’re trying to produce a numpy array of genotypes. You can do that with this:

import hail as hl
import numpy as np

mt = hl.import_vcf(...)
mt = mt.filter_cols(hl.is_defined(ht_samples[mt.s]))
ht = ht.key_by(locus=hl.locus(ht.chrom, ht.pos), alleles=[ht.ref, ht.alt])
mt = mt.filter_rows(hl.is_defined(ht[mt.row_key]))

mt = mt.select_entries(GT=mt.GT.n_alt_alleles())
mt_as_list = mt.localize_entries('entries_as_hail_array').entries_as_hail_array.GT.collect()
numpy_gts = np.array(mt_as_list)

You usually can’t combine multiple VCFs unless they’re gVCFs.

johnc1231 · September 14, 2020, 2:57pm

One more thing though: If you are going to be querying these same files a lot, you’re likely going to want to store them as a MatrixTable. hl.read_matrix_table is faster than import_vcf.

nonchev · September 14, 2020, 3:57pm

So in the end of the day your point is that in my case where I want to query multiple vcfs(They are not gvcfa) in order to combine them and return the GT field, hail is not the optimal solution?

Hails conception is to work and gain maximal performance with a single vcf file or gvcf?

tpoterba · September 14, 2020, 4:06pm

Hail is not designed to have good performance in the particular case your code implements. I think it’s possible to design a fast solution here, however, by combining the VCFs first, and querying a matrix table that. contains all samples.

Topic		Replies	Views
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2167	February 8, 2020
Is hail a good option for simple querying tasks on a large dataset (using as a "db")? Hail Query & hailctl	4	360	May 15, 2023
Export VCF taking a long time, even when running in parallel Hail Query & hailctl	3	455	December 5, 2023
Working with large VCFs (e.g. from UK Biobank) is slow Hail Query & hailctl	12	1913	August 23, 2024
Variant Annotation Table Merge? Hail Query & hailctl	2	82	April 15, 2025

Optimise querying code

Related topics