Using VCF non GT column

Hi, I need to read genotypes from a column other than GT in a vcf file. Does any one know how to do it in Hail? Many thanks for you help!

Hi Tetyana! Sorry for the silence on your other post.

If I understand correctly, I think all you should need to do is

hl.import_vcf(path_to_vcf, call_fields=[genotype_field])

It would even work without the call_fields argument, but it wouldn’t know to import that field as call-typed data, and would just represent the calls as structs (I think).

Hopefully that works for you. If not, please share here what happened.

Hi Patrick, thank you for your response and apologies for delayed follow up. I had issues with installing Hail on prem, but I think I’ve figured out the problem. So, I’ve tried the command you’ve suggested and got the following error: NameError: name ‘GTA’ is not defined
GTA is the field from which I’d like to read in the genotypes, so the command looked like this:
data=hl.import_vcf(path_to_vcf, call_fields=[GTA]). The vcf file was not compressed (can it need to be compressed?). The filed GTA was read with no problems by vcftools.

Ah, the field name needs to be a string. Try

data=hl.import_vcf(path_to_vcf, call_fields=['GTA'])

Thank you for the quick response! I was able to read in the vcf with the filed name being a string GTA. But I was not able to do anything with it afterwards. My goal is to read in a filed that is not GT and then export the data in either PLINK or vcf format, so that the non GT filed will become GT (in other words, when PLINK or other program reads the exported file, the data that was in GTA filed will be read in as GT filed). When I tried to export what was read in using the following command: hl.export_vcf(data,‘test.vcf’), I got the following error Error summary: NoClassDefFoundError: Could not initialize class com.github.luben.zstd.ZstdCompressCtx. I’m not sure if this is because Hail dependencies are not right or if there is something else going on. I’m happy to share the log file from the run.

Not sure what’s going on there. Can you share the log file, as well as the full stack trace that printed with the error? Also, where was this run (e.g. local mac/windows machine, cloud)?

I’m allowed to put only one embedded media per post, here is one screenshot of load the hail (there were some warnings), then of everything that was printed on the screen after the export_vcf command is coming…Everything was run on BROAD server. Prior to running Hail, Anaconda3 and Java 11 were loaded. Then, I used ipython to run Hail.

hail-20240218-2027-0.2.127-bb535cd096c5.log (87.3 KB)

Hi Patrick, I was wondering if you had a chance to look into the issue of outputting a vcf file from Hail?

@Tetyana_Zayats can you share the series of commands you execute after you SSH? That should allow us to reproduce and fix the issue. Thanks!

sure, it should be something like this:
ish -l h_vmem=30G
cd /path/to/the/folder/with/files
use Anaconda3
use Java11
ipython
import hail as hl
data=hl.import.vcf(‘path/file/name’,call_fields=‘GTA’)
hl.export_vcf(data,‘test.vcf’)

This kind of error generally indicates something wrong in your environment. We bundle zstd so I am confused as to what may be happening.

Could you run the following on the cluster, and provide the command’s output:

First, run pip show hail.
There should be a Location line present, for example, this is what I get when I run the command.

$ pip show hail
Name: hail
Version: 0.2.128
Summary: Scalable library for exploring and analyzing genomic data.
Home-page: https://hail.is
Author: Hail Team
Author-email: hail@broadinstitute.org
License: 
Location: /Users/cvittal/src/hail/hail/venv/hail-3.10/lib/python3.10/site-packages
Requires: aiodns, aiohttp, ... # output elided
Required-by: benchmark-hail

Change directory to the folder listed in the Location field, so for me that would be /Users/cvittal/src/hail/hail/venv/hail-3.10/lib/python3.10/site-packages.

Then, ensure the existence of hail/backend/hail-all-spark.jar (relative to site-packages).

Finally, run jar tf hail/backend/hail-all-spark.jar | sort | tee $HOME/jar-tf-hail.out, and share that output here. That last command will have saved the output to $HOME/jar-tf-hail.out so you should be able to copy it off of the server.