Generic genotype schemas in Hail

We have added new functionality to Hail that allows you to store any data type in each cell (g) of the VariantDataset (not just Genotype).

The performance of Hail in the generic case is significantly slower! Not all VariantDataset methods will work with a generic genotype schema such as pca(), ibd(), linreg(), etc.

Call Type

There is now a Call type in Hail which represents the actual genotype call without additional metadata like DP, AD, GQ, and PL that a Genotype stores. Call types have almost identical methods to Genotype. Genotype also has a new method call() which returns a Call.

Import VCF

By default, when using import_vcf() each cell (g) is imported as a Genotype object with various methods available such as g.isHet() and attributes such as g.gt, g.dp, etc. If you print out the genotype schema, you’d see the following:

>>> vds = hc.import_vcf("sample.vcf.bgz")
>>> print(vds.genotype_schema)
Genotype

However, now you can set generic=True when importing data from a VCF file. This will import all FORMAT fields in the VCF as a Struct with the field names equal to the IDs in the FORMAT fields. For example, the genotype schema would be:

>>> vds_generic = hc.import_vcf("sample.vcf.bgz", generic=True)
>>> print(vds_generic.genotype_schema)
Struct {
    GT: Call,
    AD: Array[Int],
    DP: Int,
    GQ: Int,
    PL: Array[Int]
}

The GT field is automatically imported as a Call type. How you’d access a method such as isHet() is different. In the default Genotype case, you’d say g.isHet(). In the generic case, you’d say g.GT.isHet() because you first have to access the GT field from the Struct with annotation path equal to g.

If you have additional FORMAT fields that should be treated as a genotype call, use the call_fields parameter to import_vcf() to specify which fields to import as a Call type.

Annotate Genotypes

We have added a new method annotate_genotypes_expr() that allows you to annotate each cell in the variant dataset. This method is analogous to annotate_variants_expr() and annotate_samples_expr() except the annotation root path is g instead of va and sa. The variables accessible in the right hand side of the expression are g, v, va, s, sa, and global.

Examples

Convert Genotype to Struct and add new fields

If the genotype schema has type Genotype, we can convert g to a Struct with three fields DP, GT, and foo. Note that the type of g on the right-hand side of the equation is a Genotype because the VDS was imported with generic=False by default.

>>> vds = hc.import_vcf('sample.vcf.bgz')
>>> vds_result = vds.annotate_genotypes_expr('g.DP = g.dp, g.GT = g.gt, g.foo.bar = "hello"')
>>> print(vds_result.genotype_schema)
Struct {
    DP: Int,
    GT: Int,
    foo: Struct {
        bar: String
    }
}

Modify genotype fields

We can use the Genotype() constructor in the expression language to update genotype fields. In this example, we keep the gt, ad, and pl fields the same and then update the dp and gq fields with new values:

>>> vds_result = vds.annotate_genotypes_expr("g = Genotype(v, g.gt, g.ad, g.dp * 5, g.gq + 3, g.pl)")
>>> print(vds_result.genotype_schema)
Genotype

Convert a Call to a Genotype

If the genotype schema has type Struct with a field GT with type Call, we can convert the Call to a Genotype with the toGenotype() method:

>>> vds_result = vds_generic.annotate_genotypes_expr("g = g.GT.toGenotype()")
>>> print(vds_result.genotype_schema)
Genotype

By converting g to a Genotype from a Struct, the new variant dataset will have a genotype schema equal to Genotype with the faster performance and all available methods such as pca(), ibd(), linreg(), etc.