In Hail is there a way to filter variants by sample id for the 1000 genomes VCF found here?
s3://broad-references/hg38/v0/1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
In Hail is there a way to filter variants by sample id for the 1000 genomes VCF found here?
s3://broad-references/hg38/v0/1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf
filter variants by sample id
Can you give an example?
Get all the variants for sample ‘HG00101’ from this wide gVCF. All the 1000 genomes data is in a sparse matrix. I want to write the data to parquet partitioned on sample id.
import_vcf will import a MatrixTable object, which is a nice 2-dimensional representation of the genetic matrix. To select only one sample, you can do:
mt = hl.import_vcf(...)
mt = mt.filter_cols(mt.s == 'HG00101')
However, this will still be O(total samples), so depending on the downstream workflow it maybe inefficient.
If you can share a bit more about your wider use case, we might be able to offer some more ideas.
I want to build a data lake with 1000 genomes data partitioned on sample id and in a parquet format. All the public datasets I’m aware of for 1000 genomes are wide gVCF for the entire dataset or by chromosome. There are fastq files but no VCF for each of the 2500 samples. Trying to use Hail to get variants by sample id and write to the data lake.
mt = hc.import_vcf(gvcf_path, force=True)
mt = mt.filter_cols(mt.s == ‘HG00101’)
‘VariantDataset’ object has no attribute ‘filter_cols’
Traceback (most recent call last):
AttributeError: ‘VariantDataset’ object has no attribute 'filter_cols
OK, so fundamentally this is a transpose - the VCF is variant-major, and you want the data sample-major.
You can do this with:
# keep only row and col key, no other row/col fields
mt = mt.select_rows().select_cols()
entries = mt.entries() # take the coordinate representation
entries = entries.key_by('s') # key by (and sort by) sample ID
ah, you’re using 0.1! I missed the forum location.
0.1 is totally deprecated. You should switch to 0.2, especially since it sounds like you haven’t written a huge codebase against Hail yet.
k. Let me get it installed
Don’t think I can build v 0.2 . The gradle files are missing. I cloned https://github.com/hail-is/hail/tree/0.2.10
This command doesnt work
./gradlew -Dspark.version=2.3.0 shadowJar
Any ideas? Docs don’t seem to be updated.
We added another level of nesting. See here:
https://hail.is/docs/0.2/getting_started_developing.html
cd hail/hail
FAILURE: Build failed with an exception.
Where:
Build file ‘/Users/rulaszek/hail/hail/build.gradle’ line: 64
What went wrong:
A problem occurred evaluating root project ‘hail’.
Unknown Spark version 2.3.3. Set breeze.version and py4j.version properties for Spark 2.3.3.
BUILD FAILED
SPARK_HOME=“/Users/rulaszek/spark/spark-2.3.3-bin-hadoop2.7”
Any ideas?
I think 0.10.4 for py4j and 0.13.2 for breeze.
this pull request will simplify the build system a bit.
ok. Have spark 2.3.0 installed. What do I do now.
2.3.3 should be fine, you just need to pass the breeze/py4j versions as params in build.
shadowJar -Dbreeze.version=0.13.2 -Dpy4j.version=0.10.4
k. Now I got this error
AILURE: Build failed with an exception.
Process ‘command ‘make’’ finished with non-zero exit value 2
BUILD FAILED
try --info or --stacktrace?
nativeLib FAILED
:nativeLib (Thread[main,5,main]) completed. Took 3.439 secs.
FAILURE: Build failed with an exception.
Process ‘command ‘make’’ finished with non-zero exit value 2