We have added new functionality to Hail that allows you to store any data type in each cell (g
) of the VariantDataset (not just Genotype
).
The performance of Hail in the generic case is significantly slower! Not all VariantDataset methods will work with a generic genotype schema such as
pca()
,ibd()
,linreg()
, etc.
Call Type
There is now a Call
type in Hail which represents the actual genotype call without additional metadata like DP, AD, GQ, and PL that a Genotype
stores. Call
types have almost identical methods to Genotype
. Genotype
also has a new method call()
which returns a Call
.
Import VCF
By default, when using import_vcf()
each cell (g
) is imported as a Genotype
object with various methods available such as g.isHet()
and attributes such as g.gt
, g.dp
, etc. If you print out the genotype schema, you’d see the following:
>>> vds = hc.import_vcf("sample.vcf.bgz")
>>> print(vds.genotype_schema)
Genotype
However, now you can set generic=True
when importing data from a VCF file. This will import all FORMAT fields in the VCF as a Struct
with the field names equal to the IDs in the FORMAT fields. For example, the genotype schema would be:
>>> vds_generic = hc.import_vcf("sample.vcf.bgz", generic=True)
>>> print(vds_generic.genotype_schema)
Struct {
GT: Call,
AD: Array[Int],
DP: Int,
GQ: Int,
PL: Array[Int]
}
The GT field is automatically imported as a Call
type. How you’d access a method such as isHet()
is different. In the default Genotype case, you’d say g.isHet()
. In the generic case, you’d say g.GT.isHet()
because you first have to access the GT
field from the Struct with annotation path equal to g
.
If you have additional FORMAT fields that should be treated as a genotype call, use the call_fields
parameter to import_vcf()
to specify which fields to import as a Call
type.
Annotate Genotypes
We have added a new method annotate_genotypes_expr()
that allows you to annotate each cell in the variant dataset. This method is analogous to annotate_variants_expr()
and annotate_samples_expr()
except the annotation root path is g
instead of va
and sa
. The variables accessible in the right hand side of the expression are g
, v
, va
, s
, sa
, and global
.
Examples
Convert Genotype
to Struct
and add new fields
If the genotype schema has type Genotype
, we can convert g
to a Struct
with three fields DP
, GT
, and foo
. Note that the type of g
on the right-hand side of the equation is a Genotype
because the VDS was imported with generic=False
by default.
>>> vds = hc.import_vcf('sample.vcf.bgz')
>>> vds_result = vds.annotate_genotypes_expr('g.DP = g.dp, g.GT = g.gt, g.foo.bar = "hello"')
>>> print(vds_result.genotype_schema)
Struct {
DP: Int,
GT: Int,
foo: Struct {
bar: String
}
}
Modify genotype fields
We can use the Genotype()
constructor in the expression language to update genotype fields. In this example, we keep the gt
, ad
, and pl
fields the same and then update the dp
and gq
fields with new values:
>>> vds_result = vds.annotate_genotypes_expr("g = Genotype(v, g.gt, g.ad, g.dp * 5, g.gq + 3, g.pl)")
>>> print(vds_result.genotype_schema)
Genotype
Convert a Call
to a Genotype
If the genotype schema has type Struct
with a field GT
with type Call
, we can convert the Call to a Genotype with the toGenotype()
method:
>>> vds_result = vds_generic.annotate_genotypes_expr("g = g.GT.toGenotype()")
>>> print(vds_result.genotype_schema)
Genotype
By converting g
to a Genotype
from a Struct
, the new variant dataset will have a genotype schema equal to Genotype
with the faster performance and all available methods such as pca()
, ibd()
, linreg()
, etc.