Hi, I am trying to annotate my VCF file with gnomAD annotations in Hail format. I’ve been able to do this successfully on its own, but my ultimate goal is to incorporate this into a WDL as a task within a larger workflow for use on an HPC cluster. However, when running my WDL with a block of python code as input, I get the error message detailed in the Hail log I’ve linked here. I’ve obtained the same error running this WDL pipeline with Hail versions 0.2.63-cb767a7507c8 and 0.2.107-2387bb00ceee within Docker containers from hailgenetics, while I am successful using Hail version 0.2.61-3c86d3ba497a installed within a conda environment. This also persists despite changing the number of partitions used on import. I’ve included the task definition for my WDL below.
task gnomadAnnotate {
input {
String in_sample_name
File in_normalized_vcf_file
File in_gnomad_annotations
}
command <<<
#Rename tar extension to ht
in_gnomad_annotations_str="~{in_gnomad_annotations}"
in_gnomad_annotations_hl=${in_gnomad_annotations_str%.tar}.ht
in_gnomad_annotations_hl=$(basename $in_gnomad_annotations_hl)
#Extract Hail table tarball
tar -xvf ~{in_gnomad_annotations}
python3 <<CODE
import hail as hl
hl.init(default_reference='GRCh38')
input_vcf = "~{in_normalized_vcf_file}"
export_vcf_name = "./~{in_sample_name}_gnomad_annotation.vcf"
#Import sample VCF
input_vcf_hl = hl.import_vcf(input_vcf, reference_genome='GRCh38', min_partitions=5000)
input_vcf_hl.show()
#Filter for all PASS variants (PASS variants have no filters)
input_vcf_filter_hl = input_vcf_hl.filter_rows(hl.len(input_vcf_hl.filters) == 0)
input_vcf_filter_hl.show()
####Read and format annotated Hail table
gnomad_annotations = hl.read_table("${in_gnomad_annotations_hl}")
#Left-join annotations into sample VCF to select for locus + alleles only found in VCF
input_vcf_gnomad_hl = input_vcf_filter_hl.annotate_rows(
gnomad_freq = gnomad_annotations[input_vcf_filter_hl.locus,input_vcf_filter_hl.alleles].gnomad_freq)
####Extract allele frequencies for specific groups
input_vcf_gnomad_freq_hl = input_vcf_gnomad_hl.select_rows(
gnomad_total_row=input_vcf_gnomad_hl.gnomad_freq[0].AF,
gnomad_nfe_row=input_vcf_gnomad_hl.gnomad_freq[2].AF,
gnomad_fin_row=input_vcf_gnomad_hl.gnomad_freq[3].AF,
gnomad_mid_row=input_vcf_gnomad_hl.gnomad_freq[4].AF,
gnomad_other_row=input_vcf_gnomad_hl.gnomad_freq[5].AF,
gnomad_ami_row=input_vcf_gnomad_hl.gnomad_freq[6].AF,
gnomad_asj_row=input_vcf_gnomad_hl.gnomad_freq[7].AF,
gnomad_afr_row=input_vcf_gnomad_hl.gnomad_freq[8].AF,
gnomad_eas_row=input_vcf_gnomad_hl.gnomad_freq[9].AF,
gnomad_sas_row=input_vcf_gnomad_hl.gnomad_freq[10].AF,
gnomad_amr_row=input_vcf_gnomad_hl.gnomad_freq[11].AF
)
#Replace NA values with 0
input_vcf_gnomad_freq_hl = input_vcf_gnomad_freq_hl.annotate_rows(
gnomad_total_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_total_row, 0),
gnomad_nfe_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_nfe_row, 0),
gnomad_fin_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_fin_row, 0),
gnomad_mid_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_mid_row, 0),
gnomad_other_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_other_row, 0),
gnomad_ami_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_ami_row, 0),
gnomad_asj_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_asj_row, 0),
gnomad_afr_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_afr_row, 0),
gnomad_eas_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_eas_row, 0),
gnomad_sas_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_sas_row, 0),
gnomad_amr_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_amr_row, 0)
)
input_vcf_gnomad_freq_hl = input_vcf_gnomad_freq_hl.annotate_entries(
gnomad_total_af = input_vcf_gnomad_freq_hl.gnomad_total_row,
gnomad_nfe_af = input_vcf_gnomad_freq_hl.gnomad_nfe_row,
gnomad_fin_af = input_vcf_gnomad_freq_hl.gnomad_fin_row,
gnomad_mid_af = input_vcf_gnomad_freq_hl.gnomad_mid_row,
gnomad_other_af = input_vcf_gnomad_freq_hl.gnomad_other_row,
gnomad_ami_af = input_vcf_gnomad_freq_hl.gnomad_ami_row,
gnomad_asj_af = input_vcf_gnomad_freq_hl.gnomad_asj_row,
gnomad_afr_af = input_vcf_gnomad_freq_hl.gnomad_afr_row,
gnomad_eas_af = input_vcf_gnomad_freq_hl.gnomad_eas_row,
gnomad_sas_af = input_vcf_gnomad_freq_hl.gnomad_sas_row,
gnomad_amr_af = input_vcf_gnomad_freq_hl.gnomad_amr_row
)
hl.export_vcf(input_vcf_gnomad_freq_hl, export_vcf_name)
CODE
>>>
output {
File output_gnomad_annotated_vcf = "~{in_sample_name}_gnomad_annotation.vcf"
}
runtime {
docker: "hailgenetics/hail:0.2.107"
}
}
It seems that the error arises when I use export_vcf - however, I see that when I use .show on my imported VCF matrix table, it only shows the locus and alleles without the entry fields present in the VCF, while when running this in python outside of the WDL, the entry fields appear normally alongside locus and alleles. However, .describe does still show the entry fields as it normally would. Additionally, I know the filter row field is also seems to have the appropriate information as I see with .show that I’m able to successfully choose those variants which PASS the filters, though still cannot see any of the entry fields. This suggests to me that the issue may be resulting from a problem with import_vcf, but am unclear how to resolve it. I’d appreciate any insights anyone may have!