Run a pipeline using Hail 01. for Spark and elastic

hail.java.FatalError: HailException: property `hail.vep.location’ required
I am not sure how to configure vep.
vds = vds.vep(config="/vep/vep-gcloud.properties", root=‘va.vep’, block_size=1000)

Hi Octavio,
You’ll need to install VEP on each node. here’s the contents of the script we use for Google (gs://hail-common/vep/vep/vep85-init.sh):

#!/bin/bash


# Copy VEP
mkdir -p /vep/homo_sapiens
gsutil -m cp -r gs://hail-common/vep/vep/loftee /vep
gsutil -m cp -r gs://hail-common/vep/vep/ensembl-tools-release-85 /vep
gsutil -m cp -r gs://hail-common/vep/vep/loftee_data /vep
gsutil -m cp -r gs://hail-common/vep/vep/Plugins /vep
gsutil -m cp -r gs://hail-common/vep/vep/homo_sapiens/85_GRCh37 /vep/homo_sapiens/
gsutil cp gs://hail-common/vep/vep/vep85-gcloud.properties /vep/vep-gcloud.properties

#Create symlink to vep
ln -s /vep/ensembl-tools-release-85/scripts/variant_effect_predictor /vep

#Give perms
chmod -R 777 /vep

# Copy perl JSON module
gsutil -m cp -r gs://hail-common/vep/perl-JSON/* /usr/share/perl/5.20/

#Copy perl DBD::SQLite module
gsutil -m cp -r gs://hail-common/vep/perl-SQLITE/* /usr/share/perl/5.20/


# Copy htslib and samtools
gsutil cp gs://hail-common/vep/htslib/* /usr/bin/
gsutil cp gs://hail-common/vep/samtools /usr/bin/
chmod a+rx  /usr/bin/tabix
chmod a+rx  /usr/bin/bgzip
chmod a+rx  /usr/bin/htsfile
chmod a+rx  /usr/bin/samtools

#Run VEP on the 1-variant VCF to create fasta.index file -- caution do not make fasta.index file writeable afterwards!
gsutil cp gs://hail-common/vep/vep/1var.vcf /vep
gsutil cp gs://hail-common/vep/vep/run_hail_vep85_vcf.sh /vep
chmod a+rx /vep/run_hail_vep85_vcf.sh

/vep/run_hail_vep85_vcf.sh /vep/1var.vcf

This may be very difficult since we’ll need to host all the contents on amazon.

Thanks for your response.
I do not use any cloud. Can I install that in 3 linux boxes (nodes)?
Regards,
Octavio

you should be able to, yes. The easiest thing to do will be to install gsutil actually.

Tim Thanks again for your response.
I am not sure why to use gsutil since is a google tool for their cloud. I must installed inside our cluster to keep the information private. Could you please tell me why do you prefer the cloud to install rather than a private cluster?
Thanks,
Octavio

installing gsutil will let you run that script directly to copy the public files locally. Otherwise you’ll need to rewrite it to download from their https urls.

Hi Tim,

I had a mess on my install. I was trying was on amazon but giving the dependencies (spark 2.0) I will try again on my local cluster.
I crashed the VEP install because I installed on /VEP (I think is on the root directory?) but I do not have enough disk. Can I install VEP on /opt/VEP ? What does it need to be on /VEP from your script (mkdir -p /vep/homo_sapiens)
I would appreciate your advice.
Thanks,
Octavio

Tim,
Please disregard my previous e-mail. I think I can solve the problem with an extra storage space to share the data with the spark nodes.
Thanks,
Octavio