Combine GVCF to VDS (variant dataset in Hail) and convert VDS to VCF

Hi all

I combined GVCF 5K files to one VDS by using Hail library

And I get vds (variant dataset) directory

So I want to read vds data and convert MT (matrix table including VCF data in Hail)

But I can not convert MT to VCF format files with INFO field

Below code is my code that combine multiple GVCF to one vds and convert it to vcf)

import glob
import hail as hl
hl.init()

from gnomad.utils.annotations import (
    get_adj_expr,
    get_lowqual_expr,
)
from gnomad.utils.sparse_mt import (
    get_as_info_expr,
    get_site_info_expr,
    INFO_INT32_SUM_AGG_FIELDS,
    INFO_SUM_AGG_FIELDS,
    split_info_annotation,
    split_lowqual_annotation,
)
    
# combine multiple GVCF to VDS
gvcf_list = glob.glob("*.gvcf.gz")
combiner = hl.vds.new_combiner(
    gvcf_paths = gvcf_list, 
    output_path = output.vds,
    use_genome_default_intervals=True,
    reference_genome = "GRCh38"
)
combiner.run()
    
# read vds data and make mt
vds = hl.vds.read_vds(output_vds)
mt = hl.vds.to_dense_mt(vds)
mt = mt.filter_rows((hl.len(mt.alleles) > 1))
mt = mt.transmute_entries(**mt.gvcf_info)
mt = mt.annotate_rows(alt_alleles_range_array=hl.range(1, hl.len(mt.alleles)))

## make INFO expression
info_expr = get_site_info_expr(
    mt,
    sum_agg_fields=[],
    int32_sum_agg_fields=[ ],
    array_sum_agg_fields=['SB']
)

info_expr = info_expr.annotate(
    **get_as_info_expr(
        mt,
        sum_agg_fields=[],
        int32_sum_agg_fields=[],
        array_sum_agg_fields=['SB']
        )
)
 ## Add AC and AC_raw
## First compute ACs for each non-ref allele, grouped by adj
grp_ac_expr = hl.agg.array_agg(
    lambda ai: hl.agg.filter(
        mt.LA.contains(ai),
        hl.agg.group_by(
            get_adj_expr(mt.LGT, mt.GQ, mt.DP, mt.LAD),
            hl.agg.sum(
                 mt.LGT.one_hot_alleles(mt.LA.map(hl.str))[mt.LA.index(ai)]
            ),
        ),
    ),
    mt.alt_alleles_range_array,
)
## Then, for each non-ref allele, compute
## AC as the adj group
## AC_raw as the sum of adj and non-adj groups
info_expr = info_expr.annotate(
    AC_raw=grp_ac_expr.map(
        lambda i: hl.int32(i.get(True, 0) + i.get(False, 0))
    ),
    AC=grp_ac_expr.map(lambda i: hl.int32(i.get(True, 0))),
)


## Annotating raw MT with pab max
info_expr = info_expr.annotate(
    AS_pab_max=hl.agg.array_agg(
        lambda ai: hl.agg.filter(
            mt.LA.contains(ai) & mt.LGT.is_het(),
            hl.agg.max(
                hl.binom_test(
                    mt.LAD[mt.LA.index(ai)], hl.sum(mt.LAD), 0.5, 'two-sided'
                )
            ),
        ),
        mt.alt_alleles_range_array,
    )
)

info_ht = mt.select_rows(info=info_expr).rows()

mt = mt.annotate_rows(info=info_ht[mt.row_key].info)
mt = mt.annotate_entries(
    GT = hl.vds.lgt_to_gt(mt.LGT, mt.LA)
)

drop_to_field = ['BaseQRankSum',
'ExcessHet',
'InbreedingCoeff',
'MLEAC',
'MLEAF',
'MQRankSum',
'RAW_MQandDP',
'ReadPosRankSum',
"LGT"]

mt = mt.drop(*drop_to_field)
hl.export_vcf(mt, output.vcf.bgz)

I got error message
“Error summary: HailException: VCF does not support SIndexablePointer(PCArray[PInt32])”

What is my fault?

Thank you.

Hey @Dong !

My apologies, this is our fault. You should never see this kind of error message. I’ve asked a member of the development team to figure out why this is happening and to fix it for you. We’ll respond with a quick-fix if we have one.

@Dong , it looks likely that you’re trying to export to VCF a field which is an array of arrays of integers. Can you share the output of mt.describe() just before hl.export_vcf?

VCF doesn’t permit arrays of arrays, so you’ll need to either use a TSV/CSV or change that field to not be an array of arrays.

Hi @danking

it’s my mt.describe() result

>>> mt.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'alt_alleles_range_array': array<int32>
    'info': struct {
        ReadPosRankSum: float64, 
        MQRankSum: float64, 
        SB: array<int32>, 
        FS: float64, 
        SOR: float64, 
        AS_ReadPosRankSum: array<float64>, 
        AS_MQRankSum: array<float64>, 
        AS_SB_TABLE: array<array<int32>>, 
        AS_FS: array<float64>, 
        AS_SOR: array<float64>, 
        AC_raw: array<int32>, 
        AC: array<int32>, 
        AS_pab_max: array<float64>
    }
----------------------------------------
Entry fields:
    'DP': int32
    'GQ': int32
    'MIN_DP': int32
    'LA': array<int32>
    'LAD': array<int32>
    'LPGT': call
    'LPL': array<int32>
    'RGQ': int32
    'PID': str
    'PS': int32
    'SB': array<int32>
    'GT': call
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

Thank you

Yeah, @Dong, the issue is AS_SB_TABLE. VCF doesn’t support arrays of arrays. You have to choose a different representation. For example, you might do this:

mt = mt.annotate(
    info = mt.info.annotate(
        AS_SB_TABLE = mt.info.AS_SB_TABLE.map(
            lambda inner_array: hl.str('|').join(inner_array)
        )
    )
)

But be aware! Whoever consumes this VCF will need to know how to handle AS_SB_TABLE’s |-delimited nature.

To @danking
Thank you for your reply

But I got error message.
I changed some code

mt.annotate → mt.annotate_rows

hl.str(‘|’).join(inner_array) → hl.str(‘|’),join(list(map(lambda x : str(x), inner_array)))
because i got a error “TypeError: Expected str collection, int32 found”

so I run below code

mt = mt.annotate_rows(
    info = mt.info.annotate(
        AS_SB_TABLE = mt.info.AS_SB_TABLE.map(
            lambda inner_array : hl.str('|').join(
                list(map(lambda x : str(x), inner_array))
            )
        )
    )
)

Error message …

hail.expr.expressions.base_expression.ExpressionException: <ArrayNumericExpression of type array> object is not iterabl

How can i fix it?

Thank you!!

@Dong ,

You have to use the Hail versions of all these methods:

mt = mt.annotate_rows(
    info = mt.info.annotate(
        AS_SB_TABLE = mt.info.AS_SB_TABLE.map(
            lambda inner_array : hl.str('|').join(
                hl.map(lambda x: hl.str(x), inner_array)
            )
        )
    )
)

hl.list is not necessary in this case.