Error when trying to annotate a new row with a genotypes of the sample

gil · June 18, 2023, 11:44am

Hello,
I’m trying to group by a callset with two samples by the variant type, and the genotype of each sample, and then count the number of variants in each category. To that end, I tried to annotate the rows the the genotype of the first sample:
both_hcr_hl.annotate_cols( gt0 = both_hcr_hl.GT[0] )

I am getting the following error:

2023-06-18 11:44:06 Hail: ERROR: scope violation: ‘MatrixTable.annotate_cols: field ‘gt0’’ expects an expression indexed by [‘column’]
Found indices [‘row’, ‘column’], with unexpected indices [‘row’]. Invalid fields:
‘GT’ (indices [‘row’, ‘column’])
‘MatrixTable.annotate_cols: field ‘gt0’’ supports aggregation over axes [‘row’], so these fields may appear inside an aggregator function.

Any help would be greatly appreciated,

gil · July 13, 2023, 5:51pm

This is how I solved my problem:

annotate each variant with its type

both_hcr_mt = both_hcr_mt.annotate_rows(
var_type=(hl.expr.functions.case()
.when((both_hcr_mt.alleles.length()==2) & (hl.is_snp(both_hcr_mt.alleles[0],both_hcr_mt.alleles[1])), ‘snp’)
.when((both_hcr_mt.alleles.length()==2) & (~hl.is_snp(both_hcr_mt.alleles[0],both_hcr_mt.alleles[1])), ‘non-snp’)
.when((both_hcr_mt.alleles.length()>2), ‘multi’)
.default(‘other’)
)
)

Extract the genotypes as a hail matrix table

both_hcr_gt_mt = both_hcr_mt.select_entries(both_hcr_mt.GT)
both_hcr_gt_mt = both_hcr_gt_mt.select_rows(both_hcr_gt_mt.var_type)

gt_ht = both_hcr_gt_mt.entries()
gt_ht = gt_ht.key_by(‘locus’, ‘alleles’)

Covert the genotype to a string

gt_ht = gt_ht.annotate(str_GT = hl.str(gt_ht.GT[0])+hl.literal(“/”)+hl.str(gt_ht.GT[1]))
gt_ht = gt_ht.drop(‘GT’)

Generate a table with the genotype for each sample

gt_spread_ht = hl.experimental.spread(gt_ht, ‘s’, ‘str_GT’, key=‘var_type’)

Count the number of each genotype combination between the two samples

count_ht = gt_spread_ht
.group_by(gt_spread_ht.var_type, gt_spread_ht.dv, gt_spread_ht.dallv)
.aggregate(num = hl.agg.count())

And off to Pandas!

count_df = count_ht.to_pandas()

danking · July 13, 2023, 6:09pm

EDIT: oops! I missed your reply. Glad you find something that works!

GT is an entry field. That means there’s a value of GT for each sample at each variant. When you index a GT like GT[0] you’re asking for the first allele of the genotype.

Hail is designed for datasets too large to fit in memory, so it doesn’t let you easily do things like get the genotype from a specific variant for every sample.

You can group the variants into variant types with something like:

mt = mt.group_rows_by(
    mt.variant_type
).aggregate(
    n_alt_alleles = hl.agg.sum(mt.GT.n_alt_alleles)
)

If you want to count the number of hets and homs etc you could use a grouped aggregation.

mt = mt.group_rows_by(
    mt.variant_type
).aggregate(
    n_alt_alleles = hl.agg.group_by(
        mt.GT,
        hl.agg.count()
    )
)

Topic		Replies	Views
Aggregate GT over rows Hail Query & hailctl	2	206	January 9, 2024
Annotate row with sample information based on entry criteria Hail Query & hailctl	2	444	November 13, 2022
Add row annotation with label based on entry field of one sample Hail Query & hailctl	2	487	February 11, 2022
Aggregating over genotypes Hail Query & hailctl	3	367	August 26, 2022
Gene-based sample statistics Hail Query & hailctl	6	736	January 27, 2020