Can`t write VEP annotations

HI!

I am having problem with hail.methods.vep
Applying vep is not giving any errors but when writing it to a matrix table is giving error maybe because vep is not applied successfully(but not giving error while calling hail.method.vep on my dataset).
Also when writing vep annotations, it is just writing entities, index & rows but not other things like globals, cols, references, hence giving the error;

Hail version: 0.2.23-aaf52cafe5ef
Error summary: IOException: error=2, No such file or directory

One of the important things to know about Hail is that in order to support datasets that cannot fit in memory, most operations are lazy and don’t actually execute code until you write or aggregate (or otherwise push values to disk, or return them to Python, e.g. plot, show, etc).

This means that the error is defintely coming from VEP, it’s just appearing later (when the code to run VEP is executed).

Where are you running Hail? If you’re not running on Google Dataproc with hailctl dataproc, did install VEP yourself, and if so, how?

I am do running hail on google dataproc with hailctl dataproc. Is it something to do with cpu quotas? Because i am running on very low resources.

It shoudn’t be related to CPU quotas – we make sure VEP runs on a default-sized cluster (16 cores) as part of deployment.

Are you running with the --vep flag on hailictl dataproc start? That’s my best guess of what went wrong.

No i am not using --vep.
I am trying to run the below script. I have copied the codes to a py file and submitting it as a job.

https://github.com/Nealelab/recessive/blob/master/Hail_%26_Export_Pipeline_Genotyped_dataset.ipynb

Ah, okay, great – that’s the problem. If you don’t start the cluster with --vep, then the necessary files aren’t installed and you can’t run hl.vep.

how can we initialize --vep on cloud using hail start cluster? can you give the example of it because i am using hail start cluster command …

hailctl dataproc start CLUSTER_NAME --vep GRCh37, for example

vep is working now. But while using .write, its taking much long time, i wait 1 hour but it didnt give me any error, neither it shows any progress. What to do now.

Hail is lazy; hl.vep didn’t actually run until write. How big is your data? How many variants/samples, and what are you doing in the pipeline? VEP is very slow, so I’d expect it to take a while on a small cluster.

The dataset is whole exome sequencing data on 100 participants. We are running it on the GCP. The import VCF and write It as a mt it works quickly but when we use VEP and write it, it takes forever. We are using the hailctl dataproc command that you mentioned above to launch the cluster.

VEP is able to annotate ~3 variants per second, per core. If you’re running on 12 cores (dataproc default cluster size), then you should be annotating 36 variants per second, or 129,000 per hour.

So can we increase the cores and memory? And how many should we use?

It’s hard to make a good guess about this, but I’d generally think that a cluster of ~100 cores (which would be 12 preemptible workers, -p 12 in hailctl dataproc start) would be a good idea. This cluster should cost about $3 per hour.

Also, how are you running on dataproc? using submit or connect with a notebook? You should be seeing a progress bar in either mode, which can help tell you how the vep/write is doing.

Can you kindly specify the full command to increase the number of the cores starting with
Hailctl

We are using submit. The bar was not moving so we thought it is just not responding.

Also this is just the practice dataset. Our real dataset is 40,000 participants with WES data. What do you suggest the number of cores we should use in that scenario

with 40,000 samples, you could probably use 500 or so cores and see reasonably good performance.

Note that VEP won’t be 400x slower with 40,000 people, because you won’t have 400x more variants.

The reason you’re not seeing progress is probably that with only 100 people, each partition (parallel task) in the pipeline contains many variants, so VEP is taking a long time to finish a single task.

I wanted to filter based on a locus; however it is giving me an error:

data = mt.filter_rows((mt.locus == 1:139120)),keep=True)

I have tried several permutations of the above but the error persists