Optimise querying code


I tried several different approaches for my project firstly with MatrixTable, but the performance was bad.
Now I am trying to work only with Tables. My pipeline has 2 steps.

step 1: injest vcf, i.e import vcf files into Tables and store them in a folder(table per vcf approach)
step 2: querying Tables to get GT, GQ or DP fields for given set of variants and samples:

code.txt (1.6 KB)

Can you look into this code snippet and give me an advice/suggestion how can I improve/optimise the code, because I am disappointed by its performance?
note: I tried to add checkpoints or .persist() but the different wasn’t significant, but maybe they were on the wrong place…
Thanks in advance!


I don’t understand what you’re doing yet. You make a variant table from pandas, then a sample table from pandas, then you join them together? Are either of these tables keyed? If not, I’m pretty sure join is just going to do a cartesian product, which doesn’t seem like what you’d want?

Knowing the input here would help a lot. What is var_df?

I think one of the core problems, though, is that this pipeline scales with the square of the number of samples. If you find yourself using Python for loops to iteratively filter and collect pieces of a Hail table or matrix table, this is a red flag. the line

res_np.append(np.array(res_table.filter(res_table.s==sample)[field].collect()).reshape(-1, 1))

Requires a full pass over the table to filter to rows that have the sample of interest. You’re doing this once per sample, hence quadratic scaling.

The way you’re iterating to build a union table is also concerning, I think – there may be a better way. What are these ht_vcf inputs? These are from VCFs that represent the same samples, but different variants?

var_df is the input from the user: dataframe with variants
Screenshot 2020-09-14 at 17.06.18

Screenshot 2020-09-14 at 17.07.52

here I am building the cartesian product between between samples and variants:

ht_vcf - a vcf file which can contain variants for single or multiple samples. ht_paths are the different vcf files which contain information for my samples of interest.

ht_n :

A user gives his variants and samples of interest, and I want to query my vcf files(stored as tables) and return the GT,DP or GQ field (0 if the variant is not present for this sample) (get_geneotype calls _get(…, ‘GT’))

Computing the Cartesian product of samples and variants will not perform well. It looks like you’re trying to produce a numpy array of genotypes. You can do that with this:

import hail as hl
import numpy as np

mt = hl.import_vcf(...)
mt = mt.filter_cols(hl.is_defined(ht_samples[mt.s]))
ht = ht.key_by(locus=hl.locus(ht.chrom, ht.pos), alleles=[ht.ref, ht.alt])
mt = mt.filter_rows(hl.is_defined(ht[mt.row_key]))

mt = mt.select_entries(GT=mt.GT.n_alt_alleles())
mt_as_list = mt.localize_entries('entries_as_hail_array').entries_as_hail_array.GT.collect()
numpy_gts = np.array(mt_as_list)

You usually can’t combine multiple VCFs unless they’re gVCFs.

One more thing though: If you are going to be querying these same files a lot, you’re likely going to want to store them as a MatrixTable. hl.read_matrix_table is faster than import_vcf.

So in the end of the day your point is that in my case where I want to query multiple vcfs(They are not gvcfa) in order to combine them and return the GT field, hail is not the optimal solution?

Hails conception is to work and gain maximal performance with a single vcf file or gvcf?

Hail is not designed to have good performance in the particular case your code implements. I think it’s possible to design a fast solution here, however, by combining the VCFs first, and querying a matrix table that. contains all samples.