Can`t write VEP annotations

Haseeb1 · October 1, 2019, 5:10am

HI!

I am having problem with hail.methods.vep
Applying vep is not giving any errors but when writing it to a matrix table is giving error maybe because vep is not applied successfully(but not giving error while calling hail.method.vep on my dataset).
Also when writing vep annotations, it is just writing entities, index & rows but not other things like globals, cols, references, hence giving the error;

Hail version: 0.2.23-aaf52cafe5ef
Error summary: IOException: error=2, No such file or directory

tpoterba · October 1, 2019, 9:28am

One of the important things to know about Hail is that in order to support datasets that cannot fit in memory, most operations are lazy and don’t actually execute code until you write or aggregate (or otherwise push values to disk, or return them to Python, e.g. plot, show, etc).

This means that the error is defintely coming from VEP, it’s just appearing later (when the code to run VEP is executed).

Where are you running Hail? If you’re not running on Google Dataproc with hailctl dataproc, did install VEP yourself, and if so, how?

Haseeb1 · October 1, 2019, 10:19am

I am do running hail on google dataproc with hailctl dataproc. Is it something to do with cpu quotas? Because i am running on very low resources.

tpoterba · October 1, 2019, 10:21am

It shoudn’t be related to CPU quotas – we make sure VEP runs on a default-sized cluster (16 cores) as part of deployment.

Are you running with the --vep flag on hailictl dataproc start? That’s my best guess of what went wrong.

Haseeb1 · October 1, 2019, 11:02am

No i am not using --vep.
I am trying to run the below script. I have copied the codes to a py file and submitting it as a job.

https://github.com/Nealelab/recessive/blob/master/Hail_%26_Export_Pipeline_Genotyped_dataset.ipynb

tpoterba · October 1, 2019, 11:04am

Ah, okay, great – that’s the problem. If you don’t start the cluster with --vep, then the necessary files aren’t installed and you can’t run hl.vep.

qurat · October 1, 2019, 11:31am

how can we initialize --vep on cloud using hail start cluster? can you give the example of it because i am using hail start cluster command …

tpoterba · October 1, 2019, 11:37am

hailctl dataproc start CLUSTER_NAME --vep GRCh37, for example

Haseeb1 · October 2, 2019, 8:12am

vep is working now. But while using .write, its taking much long time, i wait 1 hour but it didnt give me any error, neither it shows any progress. What to do now.

tpoterba · October 2, 2019, 1:51pm

Hail is lazy; hl.vep didn’t actually run until write. How big is your data? How many variants/samples, and what are you doing in the pipeline? VEP is very slow, so I’d expect it to take a while on a small cluster.

Danish436 · October 2, 2019, 3:13pm

The dataset is whole exome sequencing data on 100 participants. We are running it on the GCP. The import VCF and write It as a mt it works quickly but when we use VEP and write it, it takes forever. We are using the hailctl dataproc command that you mentioned above to launch the cluster.

tpoterba · October 2, 2019, 3:15pm

VEP is able to annotate ~3 variants per second, per core. If you’re running on 12 cores (dataproc default cluster size), then you should be annotating 36 variants per second, or 129,000 per hour.

Danish436 · October 2, 2019, 3:16pm

So can we increase the cores and memory? And how many should we use?

tpoterba · October 2, 2019, 3:19pm

It’s hard to make a good guess about this, but I’d generally think that a cluster of ~100 cores (which would be 12 preemptible workers, -p 12 in hailctl dataproc start) would be a good idea. This cluster should cost about $3 per hour.

Also, how are you running on dataproc? using submit or connect with a notebook? You should be seeing a progress bar in either mode, which can help tell you how the vep/write is doing.

Danish436 · October 2, 2019, 3:23pm

Can you kindly specify the full command to increase the number of the cores starting with
Hailctl

We are using submit. The bar was not moving so we thought it is just not responding.

Also this is just the practice dataset. Our real dataset is 40,000 participants with WES data. What do you suggest the number of cores we should use in that scenario

tpoterba · October 2, 2019, 3:25pm

with 40,000 samples, you could probably use 500 or so cores and see reasonably good performance.

Note that VEP won’t be 400x slower with 40,000 people, because you won’t have 400x more variants.

The reason you’re not seeing progress is probably that with only 100 people, each partition (parallel task) in the pipeline contains many variants, so VEP is taking a long time to finish a single task.

Danish436 · October 4, 2019, 7:46pm

I wanted to filter based on a locus; however it is giving me an error:

data = mt.filter_rows((mt.locus == 1:139120)),keep=True)

I have tried several permutations of the above but the error persists

Topic		Replies	Views
Can't write VEP annotated hail table Hail Query & hailctl	4	653	February 13, 2019
Cant write out vep annotated vcf file Hail Query & hailctl	26	1781	May 27, 2019
Cannot write vep annotations Hail Query & hailctl	1	189	June 26, 2023
Error when writing a large VEP annotated Hail Table Hail Query & hailctl	2	549	March 6, 2023
VEP annotation (IOException: error=13, Permission denied) Hail Query & hailctl	10	2216	September 26, 2019

Can`t write VEP annotations

Related topics