Table.annotate takes a while

Hello,

I’m not sure what’s going on as this is my first time using HAIL but the issue I ran into was trying to annotate a VCF file with some gnomAD data.

gnomAD_raw = hl.read_table('/Users/irenaeuschan/Documents/Irenaeus/Scripts/gnomAD/gnomad.exomes.r2.1.1.sites.ht')
vcf = hl.import_vcf(vcf_file, reference_genome='GRCh37')
vcf.write(base_name + ".mt", overwrite=True)
mt = hl.read_matrix_table(base_name + ".mt")
mt = mt.distinct_by_row()
t = mt.make_table(separator='_')

then annotating

ht_annotate = t.annotate(gnomad_age_dist = gnomAD_raw[t.locus, t.alleles]['age_hist_het']['bin_freq'][age_index])
ht_annotate.export("output/annotated.tsv")

Issue here is at the annotate step, my VCF file contains 15 rows of data (testing with a smaller dataset for now) but I have been stuck on this annotation Stage for an hour, which doesn’t seem to make sense given the small input data. Furthermore, after the annotate (once it finally finishes), the exporting takes even longer.

Is there a reason for this happening or am I missing something with how I am trying to write the program?

Thanks

The reason this is taking a while is that while your VCF is small, the gnomAD dataset is very large. I think that entire table is getting read to execute this join. Put another way - the execution is O(N + M), where N is the size of your VCF and M is the size of gnomAD.

This is actually a tricky use-case to properly optimize (you want to read only those 15 rows from the gnomad table), but we should have some infrastructure improvements to speed up this pipeline at some point in the medium-term future. Hail is more designed for joining two similarly-large inputs, so if you run this on a larger VCF you won’t see unexpected time sinks (since you have to read both inputs fully regardless!)

One question - how many samples do you have in that VCF? make_table really doesn’t scale well, and you should try to stay in matrixtable world if possible.

Thank you for your response,

To optimize this process, would pre-filtering the gnomAD dataset be useful to create a sub filtered gnomAD dataset? Something like…

gnomAD_filtered = gnomAD_raw.filter(hl.is_defined(t[gnomAD_raw.locus, gnomAD_raw.alleles]))

This would restrict my gnomAD dataset to only the locus and alleles found within my tables, so the joins would be much faster?

One question - how many samples do you have in that VCF? make_table really doesn’t scale well, and you should try to stay in matrixtable world if possible.

For this initial test set there are only 40 Samples, but I suspect there will be over 1,000 in the actual dataset. I am using the make_table in order to create the filtered version of the gnomAD and the only way I was able to make it work was using a table. Tried to filter the gnomAD_raw using a matrix_table and was getting some ht.distinct, ht.select error each time.

If you’re going to do this many times, then pre-filtering with this code above, writing to disk, and reading back the subset in the future will save a lot of time. If you’re doing this annotation once, the two are equivalent.

Can you post what you tried on the matrixtable? Working with a MT should be easier than with a make_table representation.

It will be hard to write to disk for our case because our annotation query will be different each time we will be using this segment of the code. Additionally, the issue is how long it will take, I am currently running this annotation on one of our clusters and with allocating 20 processors, 240GB of RAM, and giving it 24GB for driver-memory the annotation is taking 8+ hours as it currently stands.

I’m not sure how to make it more efficient. Or if there is a way to make it more efficient without having to wait hours for the annotation to occur and then to export the table.

As per MT

This is what I have now,

gnomAD_filtered = gnomAD_raw.filter(hl.is_defined(t[gnomAD_raw.locus, gnomAD_raw.alleles]))

But previously, I tried to filter the gnomAD_raw by doing something like,

gnomAD_filtered = gnomAD_raw.filter(hl.is_defined(gnomAD_raw[mt.locus, mt.alleles]))

which gave this error because the gnomAD_raw is a Hail Table

Expected an expression from source <hail.table.Table object at 0x7fb8953f7320>
Found expression derived from source <hail.matrixtable.MatrixTable object at 0x7fb8951ce128>
Problematic field(s): [‘locus’, ‘alleles’]

This error is commonly caused by chaining methods together:
>>> ht.distinct().select(ht.x)

This is throwing an error because the join is flipped. Should be:

gnomAD_filtered = gnomAD_raw.filter(hl.is_defined(mt.index_rows(gnomAD_raw.key)))

or easier:

gnomAD_filtered = gnomAD_raw.semi_join(mt.rows())

The 8-hour runtime is surprising, though. I would expect this to take less than an hour.