Issues Importing/Exporting VCF within WDL Workflow

chiggsk · January 13, 2023, 4:38am

Hi, I am trying to annotate my VCF file with gnomAD annotations in Hail format. I’ve been able to do this successfully on its own, but my ultimate goal is to incorporate this into a WDL as a task within a larger workflow for use on an HPC cluster. However, when running my WDL with a block of python code as input, I get the error message detailed in the Hail log I’ve linked here. I’ve obtained the same error running this WDL pipeline with Hail versions 0.2.63-cb767a7507c8 and 0.2.107-2387bb00ceee within Docker containers from hailgenetics, while I am successful using Hail version 0.2.61-3c86d3ba497a installed within a conda environment. This also persists despite changing the number of partitions used on import. I’ve included the task definition for my WDL below.

task gnomadAnnotate {
    input {
        String in_sample_name
        File in_normalized_vcf_file
        File in_gnomad_annotations
    }

    command <<<
   
        #Rename tar extension to ht
        in_gnomad_annotations_str="~{in_gnomad_annotations}"
        in_gnomad_annotations_hl=${in_gnomad_annotations_str%.tar}.ht
        in_gnomad_annotations_hl=$(basename $in_gnomad_annotations_hl)

        #Extract Hail table tarball
        tar -xvf ~{in_gnomad_annotations}

        python3 <<CODE

        import hail as hl
        
        hl.init(default_reference='GRCh38')

        input_vcf = "~{in_normalized_vcf_file}"
        export_vcf_name = "./~{in_sample_name}_gnomad_annotation.vcf"

        #Import sample VCF
        input_vcf_hl = hl.import_vcf(input_vcf, reference_genome='GRCh38', min_partitions=5000)
        input_vcf_hl.show()

        #Filter for all PASS variants (PASS variants have no filters)
        input_vcf_filter_hl = input_vcf_hl.filter_rows(hl.len(input_vcf_hl.filters) == 0)
        input_vcf_filter_hl.show()

        ####Read and format annotated Hail table

        gnomad_annotations = hl.read_table("${in_gnomad_annotations_hl}")

        #Left-join annotations into sample VCF to select for locus + alleles only found in VCF
        input_vcf_gnomad_hl = input_vcf_filter_hl.annotate_rows(
            gnomad_freq = gnomad_annotations[input_vcf_filter_hl.locus,input_vcf_filter_hl.alleles].gnomad_freq)

        ####Extract allele frequencies for specific groups

        input_vcf_gnomad_freq_hl = input_vcf_gnomad_hl.select_rows(
            gnomad_total_row=input_vcf_gnomad_hl.gnomad_freq[0].AF,
            gnomad_nfe_row=input_vcf_gnomad_hl.gnomad_freq[2].AF,
            gnomad_fin_row=input_vcf_gnomad_hl.gnomad_freq[3].AF,
            gnomad_mid_row=input_vcf_gnomad_hl.gnomad_freq[4].AF,
            gnomad_other_row=input_vcf_gnomad_hl.gnomad_freq[5].AF,
            gnomad_ami_row=input_vcf_gnomad_hl.gnomad_freq[6].AF,
            gnomad_asj_row=input_vcf_gnomad_hl.gnomad_freq[7].AF,
            gnomad_afr_row=input_vcf_gnomad_hl.gnomad_freq[8].AF,
            gnomad_eas_row=input_vcf_gnomad_hl.gnomad_freq[9].AF,
            gnomad_sas_row=input_vcf_gnomad_hl.gnomad_freq[10].AF,
            gnomad_amr_row=input_vcf_gnomad_hl.gnomad_freq[11].AF
        )

        #Replace NA values with 0
        input_vcf_gnomad_freq_hl = input_vcf_gnomad_freq_hl.annotate_rows(
            gnomad_total_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_total_row, 0),
            gnomad_nfe_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_nfe_row, 0),
            gnomad_fin_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_fin_row, 0),
            gnomad_mid_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_mid_row, 0),
            gnomad_other_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_other_row, 0),
            gnomad_ami_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_ami_row, 0),
            gnomad_asj_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_asj_row, 0),
            gnomad_afr_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_afr_row, 0),
            gnomad_eas_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_eas_row, 0),
            gnomad_sas_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_sas_row, 0),
            gnomad_amr_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_amr_row, 0)
        )

        input_vcf_gnomad_freq_hl = input_vcf_gnomad_freq_hl.annotate_entries(
            gnomad_total_af = input_vcf_gnomad_freq_hl.gnomad_total_row,
            gnomad_nfe_af = input_vcf_gnomad_freq_hl.gnomad_nfe_row,
            gnomad_fin_af = input_vcf_gnomad_freq_hl.gnomad_fin_row,
            gnomad_mid_af = input_vcf_gnomad_freq_hl.gnomad_mid_row,
            gnomad_other_af = input_vcf_gnomad_freq_hl.gnomad_other_row,
            gnomad_ami_af = input_vcf_gnomad_freq_hl.gnomad_ami_row,
            gnomad_asj_af = input_vcf_gnomad_freq_hl.gnomad_asj_row,
            gnomad_afr_af = input_vcf_gnomad_freq_hl.gnomad_afr_row,
            gnomad_eas_af = input_vcf_gnomad_freq_hl.gnomad_eas_row,
            gnomad_sas_af = input_vcf_gnomad_freq_hl.gnomad_sas_row,
            gnomad_amr_af = input_vcf_gnomad_freq_hl.gnomad_amr_row
        )
        
        hl.export_vcf(input_vcf_gnomad_freq_hl, export_vcf_name)

        CODE

        >>>
        
    output {
        File output_gnomad_annotated_vcf = "~{in_sample_name}_gnomad_annotation.vcf"
    }
    runtime {
        docker: "hailgenetics/hail:0.2.107"
    }
}

It seems that the error arises when I use export_vcf - however, I see that when I use .show on my imported VCF matrix table, it only shows the locus and alleles without the entry fields present in the VCF, while when running this in python outside of the WDL, the entry fields appear normally alongside locus and alleles. However, .describe does still show the entry fields as it normally would. Additionally, I know the filter row field is also seems to have the appropriate information as I see with .show that I’m able to successfully choose those variants which PASS the filters, though still cannot see any of the entry fields. This suggests to me that the issue may be resulting from a problem with import_vcf, but am unclear how to resolve it. I’d appreciate any insights anyone may have!

danking · February 6, 2023, 10:33pm

Hey @chiggsk !

The error in the log suggests that some Hail file is corrupted. We had an issue where Hail was generating corrupt files a while ago. Where did you get the gnomad Table from? You mention this works locally. Are you using the exact same table?

show only shows you as much data as it thinks will fit in your terminal. It probably determined that your “terminal” in the WDL pipeline is too skinny to show anything but locus and alleles. You can explicitly demand more rows and columns with show(n_rows=10, n_cols=10).

describe just shows you statically known information. It doesn’t execute the pipeline.

chiggsk · February 17, 2023, 2:52pm

Hi @danking, thank you for your response.

I believe the issue resulted from me initially attempting to filter my original Hail table into a smaller table, which I would then use to annotate my VCF file. It is still unclear to me whether the filtered Hail table was just not filtered correctly, or whether the filtering process corrupted the Hail table, but the issue seemed restricted to the filtered Hail table. However, I was able to find a work around so this is no longer an issue - thank you!

Topic		Replies	Views
Export data after annotation Hail Query & hailctl	3	407	June 22, 2020
Extracting gnomad counts into a vcf file Hail Query & hailctl	7	1173	March 12, 2021
Table.annotate takes a while Hail Query & hailctl	6	405	March 15, 2021
How to add annotations to a vcf file? Hail Query & hailctl	1	624	October 31, 2018
How to speed up export_vcf? Hail Query & hailctl	6	493	September 24, 2021

Issues Importing/Exporting VCF within WDL Workflow

Related topics