Filter variants by sample id in gVCF

rulaszek · February 27, 2019, 7:09pm

In Hail is there a way to filter variants by sample id for the 1000 genomes VCF found here?

s3://broad-references/hg38/v0/1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf

tpoterba · February 27, 2019, 7:10pm

filter variants by sample id

Can you give an example?

rulaszek · February 27, 2019, 7:11pm

Get all the variants for sample ‘HG00101’ from this wide gVCF. All the 1000 genomes data is in a sparse matrix. I want to write the data to parquet partitioned on sample id.

rulaszek · February 27, 2019, 7:15pm

Something like this but in Hail. ![46%20PM|493x500]

tpoterba · February 27, 2019, 7:20pm

import_vcf will import a MatrixTable object, which is a nice 2-dimensional representation of the genetic matrix. To select only one sample, you can do:

mt = hl.import_vcf(...)
mt = mt.filter_cols(mt.s == 'HG00101')

However, this will still be O(total samples), so depending on the downstream workflow it maybe inefficient.

tpoterba · February 27, 2019, 7:20pm

If you can share a bit more about your wider use case, we might be able to offer some more ideas.

rulaszek · February 27, 2019, 7:24pm

I want to build a data lake with 1000 genomes data partitioned on sample id and in a parquet format. All the public datasets I’m aware of for 1000 genomes are wide gVCF for the entire dataset or by chromosome. There are fastq files but no VCF for each of the 2500 samples. Trying to use Hail to get variants by sample id and write to the data lake.

rulaszek · February 27, 2019, 7:28pm

mt = hc.import_vcf(gvcf_path, force=True)
mt = mt.filter_cols(mt.s == ‘HG00101’)

‘VariantDataset’ object has no attribute ‘filter_cols’
Traceback (most recent call last):
AttributeError: ‘VariantDataset’ object has no attribute 'filter_cols

tpoterba · February 27, 2019, 7:28pm

OK, so fundamentally this is a transpose - the VCF is variant-major, and you want the data sample-major.

You can do this with:

# keep only row and col key, no other row/col fields
mt = mt.select_rows().select_cols()
entries = mt.entries() # take the coordinate representation
entries = entries.key_by('s') # key by (and sort by) sample ID

tpoterba · February 27, 2019, 7:29pm

ah, you’re using 0.1! I missed the forum location.

0.1 is totally deprecated. You should switch to 0.2, especially since it sounds like you haven’t written a huge codebase against Hail yet.

rulaszek · February 27, 2019, 7:29pm

k. Let me get it installed

rulaszek · February 27, 2019, 10:02pm

Don’t think I can build v 0.2 . The gradle files are missing. I cloned https://github.com/hail-is/hail/tree/0.2.10

This command doesnt work

./gradlew -Dspark.version=2.3.0 shadowJar

Any ideas? Docs don’t seem to be updated.

tpoterba · February 27, 2019, 10:09pm

We added another level of nesting. See here:

https://hail.is/docs/0.2/getting_started_developing.html

cd hail/hail

rulaszek · February 27, 2019, 10:50pm

FAILURE: Build failed with an exception.

Where:
Build file ‘/Users/rulaszek/hail/hail/build.gradle’ line: 64
What went wrong:
A problem occurred evaluating root project ‘hail’.

Unknown Spark version 2.3.3. Set breeze.version and py4j.version properties for Spark 2.3.3.

Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

SPARK_HOME=“/Users/rulaszek/spark/spark-2.3.3-bin-hadoop2.7”

Any ideas?

tpoterba · February 27, 2019, 10:54pm

I think 0.10.4 for py4j and 0.13.2 for breeze.

this pull request will simplify the build system a bit.

rulaszek · February 27, 2019, 11:01pm

ok. Have spark 2.3.0 installed. What do I do now.

tpoterba · February 27, 2019, 11:03pm

2.3.3 should be fine, you just need to pass the breeze/py4j versions as params in build.

shadowJar -Dbreeze.version=0.13.2 -Dpy4j.version=0.10.4

rulaszek · February 27, 2019, 11:04pm

k. Now I got this error

AILURE: Build failed with an exception.

What went wrong:
Execution failed for task ‘:nativeLib’.

Process ‘command ‘make’’ finished with non-zero exit value 2

Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

tpoterba · February 27, 2019, 11:06pm

try --info or --stacktrace?

rulaszek · February 27, 2019, 11:10pm

nativeLib FAILED
:nativeLib (Thread[main,5,main]) completed. Took 3.439 secs.

FAILURE: Build failed with an exception.

What went wrong:
Execution failed for task ‘:nativeLib’.

Process ‘command ‘make’’ finished with non-zero exit value 2

Try:
Run with --stacktrace option to get the stack trace. Run with --debug option to get more log output.

Topic		Replies	Views
Help for finding rare variants for 100 patients Hail Query & hailctl	10	441	March 20, 2023
Filter all variants which belong to a sample ID Hail Query & hailctl	6	612	December 13, 2019
Filter variants in gvcf Hail Query & hailctl	3	495	November 3, 2020
Export VDS to VCF Hail Query & hailctl	12	1258	January 12, 2023
Import data from dataframe parquet into vds Help [0.1]	3	646	July 24, 2018

Filter variants by sample id in gVCF

Related topics