Issues Importing/Exporting VCF within WDL Workflow

Hi, I am trying to annotate my VCF file with gnomAD annotations in Hail format. I’ve been able to do this successfully on its own, but my ultimate goal is to incorporate this into a WDL as a task within a larger workflow for use on an HPC cluster. However, when running my WDL with a block of python code as input, I get the error message detailed in the Hail log I’ve linked here. I’ve obtained the same error running this WDL pipeline with Hail versions 0.2.63-cb767a7507c8 and 0.2.107-2387bb00ceee within Docker containers from hailgenetics, while I am successful using Hail version 0.2.61-3c86d3ba497a installed within a conda environment. This also persists despite changing the number of partitions used on import. I’ve included the task definition for my WDL below.

task gnomadAnnotate {
    input {
        String in_sample_name
        File in_normalized_vcf_file
        File in_gnomad_annotations

    command <<<
        #Rename tar extension to ht
        in_gnomad_annotations_hl=$(basename $in_gnomad_annotations_hl)

        #Extract Hail table tarball
        tar -xvf ~{in_gnomad_annotations}

        python3 <<CODE

        import hail as hl

        input_vcf = "~{in_normalized_vcf_file}"
        export_vcf_name = "./~{in_sample_name}_gnomad_annotation.vcf"

        #Import sample VCF
        input_vcf_hl = hl.import_vcf(input_vcf, reference_genome='GRCh38', min_partitions=5000)

        #Filter for all PASS variants (PASS variants have no filters)
        input_vcf_filter_hl = input_vcf_hl.filter_rows(hl.len(input_vcf_hl.filters) == 0)

        ####Read and format annotated Hail table

        gnomad_annotations = hl.read_table("${in_gnomad_annotations_hl}")

        #Left-join annotations into sample VCF to select for locus + alleles only found in VCF
        input_vcf_gnomad_hl = input_vcf_filter_hl.annotate_rows(
            gnomad_freq = gnomad_annotations[,input_vcf_filter_hl.alleles].gnomad_freq)

        ####Extract allele frequencies for specific groups

        input_vcf_gnomad_freq_hl = input_vcf_gnomad_hl.select_rows(

        #Replace NA values with 0
        input_vcf_gnomad_freq_hl = input_vcf_gnomad_freq_hl.annotate_rows(
            gnomad_total_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_total_row, 0),
            gnomad_nfe_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_nfe_row, 0),
            gnomad_fin_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_fin_row, 0),
            gnomad_mid_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_mid_row, 0),
            gnomad_other_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_other_row, 0),
            gnomad_ami_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_ami_row, 0),
            gnomad_asj_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_asj_row, 0),
            gnomad_afr_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_afr_row, 0),
            gnomad_eas_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_eas_row, 0),
            gnomad_sas_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_sas_row, 0),
            gnomad_amr_row = hl.coalesce(input_vcf_gnomad_freq_hl.gnomad_amr_row, 0)

        input_vcf_gnomad_freq_hl = input_vcf_gnomad_freq_hl.annotate_entries(
            gnomad_total_af = input_vcf_gnomad_freq_hl.gnomad_total_row,
            gnomad_nfe_af = input_vcf_gnomad_freq_hl.gnomad_nfe_row,
            gnomad_fin_af = input_vcf_gnomad_freq_hl.gnomad_fin_row,
            gnomad_mid_af = input_vcf_gnomad_freq_hl.gnomad_mid_row,
            gnomad_other_af = input_vcf_gnomad_freq_hl.gnomad_other_row,
            gnomad_ami_af = input_vcf_gnomad_freq_hl.gnomad_ami_row,
            gnomad_asj_af = input_vcf_gnomad_freq_hl.gnomad_asj_row,
            gnomad_afr_af = input_vcf_gnomad_freq_hl.gnomad_afr_row,
            gnomad_eas_af = input_vcf_gnomad_freq_hl.gnomad_eas_row,
            gnomad_sas_af = input_vcf_gnomad_freq_hl.gnomad_sas_row,
            gnomad_amr_af = input_vcf_gnomad_freq_hl.gnomad_amr_row
        hl.export_vcf(input_vcf_gnomad_freq_hl, export_vcf_name)


    output {
        File output_gnomad_annotated_vcf = "~{in_sample_name}_gnomad_annotation.vcf"
    runtime {
        docker: "hailgenetics/hail:0.2.107"

It seems that the error arises when I use export_vcf - however, I see that when I use .show on my imported VCF matrix table, it only shows the locus and alleles without the entry fields present in the VCF, while when running this in python outside of the WDL, the entry fields appear normally alongside locus and alleles. However, .describe does still show the entry fields as it normally would. Additionally, I know the filter row field is also seems to have the appropriate information as I see with .show that I’m able to successfully choose those variants which PASS the filters, though still cannot see any of the entry fields. This suggests to me that the issue may be resulting from a problem with import_vcf, but am unclear how to resolve it. I’d appreciate any insights anyone may have!

On further investigation, it doesn’t seem to be an issue with running this within a WDL workflow, but rather is an issue which seems to arise only in the analysis of certain VCF files but not others. I’ve linked the Hail log for the processing of a VCF which worked without issue, and one which did not. Both VCF files were processed exactly the same way and are the same format, and it is not apparent to me the difference between the two that would be causing this error.

I’m working on a Slurm HPC cluster in case that is playing any role here.