Filter variants by sample id in gVCF

In Hail is there a way to filter variants by sample id for the 1000 genomes VCF found here?

s3://broad-references/hg38/v0/1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf

filter variants by sample id

Can you give an example?

Get all the variants for sample ‘HG00101’ from this wide gVCF. All the 1000 genomes data is in a sparse matrix. I want to write the data to parquet partitioned on sample id.

Something like this but in Hail. ![46%20PM|493x500]

import_vcf will import a MatrixTable object, which is a nice 2-dimensional representation of the genetic matrix. To select only one sample, you can do:

mt = hl.import_vcf(...)
mt = mt.filter_cols(mt.s == 'HG00101')

However, this will still be O(total samples), so depending on the downstream workflow it maybe inefficient.

If you can share a bit more about your wider use case, we might be able to offer some more ideas.

I want to build a data lake with 1000 genomes data partitioned on sample id and in a parquet format. All the public datasets I’m aware of for 1000 genomes are wide gVCF for the entire dataset or by chromosome. There are fastq files but no VCF for each of the 2500 samples. Trying to use Hail to get variants by sample id and write to the data lake.

mt = hc.import_vcf(gvcf_path, force=True)
mt = mt.filter_cols(mt.s == ‘HG00101’)

‘VariantDataset’ object has no attribute ‘filter_cols’
Traceback (most recent call last):
AttributeError: ‘VariantDataset’ object has no attribute 'filter_cols

OK, so fundamentally this is a transpose - the VCF is variant-major, and you want the data sample-major.

You can do this with:

# keep only row and col key, no other row/col fields
mt = mt.select_rows().select_cols()
entries = mt.entries() # take the coordinate representation
entries = entries.key_by('s') # key by (and sort by) sample ID

ah, you’re using 0.1! I missed the forum location.

0.1 is totally deprecated. You should switch to 0.2, especially since it sounds like you haven’t written a huge codebase against Hail yet.

k. Let me get it installed

Don’t think I can build v 0.2 . The gradle files are missing. I cloned https://github.com/hail-is/hail/tree/0.2.10

This command doesnt work

./gradlew -Dspark.version=2.3.0 shadowJar

Any ideas? Docs don’t seem to be updated.

We added another level of nesting. See here:

https://hail.is/docs/0.2/getting_started_developing.html

cd hail/hail

FAILURE: Build failed with an exception.

  • Where:
    Build file ‘/Users/rulaszek/hail/hail/build.gradle’ line: 64

  • What went wrong:
    A problem occurred evaluating root project ‘hail’.

Unknown Spark version 2.3.3. Set breeze.version and py4j.version properties for Spark 2.3.3.

  • Try:
    Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

SPARK_HOME=“/Users/rulaszek/spark/spark-2.3.3-bin-hadoop2.7”

Any ideas?

I think 0.10.4 for py4j and 0.13.2 for breeze.

this pull request will simplify the build system a bit.

ok. Have spark 2.3.0 installed. What do I do now.

2.3.3 should be fine, you just need to pass the breeze/py4j versions as params in build.

shadowJar -Dbreeze.version=0.13.2 -Dpy4j.version=0.10.4

k. Now I got this error

AILURE: Build failed with an exception.

  • What went wrong:
    Execution failed for task ‘:nativeLib’.

Process ‘command ‘make’’ finished with non-zero exit value 2

  • Try:
    Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

try --info or --stacktrace?

nativeLib FAILED
:nativeLib (Thread[main,5,main]) completed. Took 3.439 secs.

FAILURE: Build failed with an exception.

  • What went wrong:
    Execution failed for task ‘:nativeLib’.

Process ‘command ‘make’’ finished with non-zero exit value 2

  • Try:
    Run with --stacktrace option to get the stack trace. Run with --debug option to get more log output.