I’m trying to group by a callset with two samples by the variant type, and the genotype of each sample, and then count the number of variants in each category. To that end, I tried to annotate the rows the the genotype of the first sample:
both_hcr_hl.annotate_cols( gt0 = both_hcr_hl.GT )
I am getting the following error:
2023-06-18 11:44:06 Hail: ERROR: scope violation: ‘MatrixTable.annotate_cols: field ‘gt0’’ expects an expression indexed by [‘column’]
Found indices [‘row’, ‘column’], with unexpected indices [‘row’]. Invalid fields:
‘GT’ (indices [‘row’, ‘column’])
‘MatrixTable.annotate_cols: field ‘gt0’’ supports aggregation over axes [‘row’], so these fields may appear inside an aggregator function.
Any help would be greatly appreciated,
This is how I solved my problem:
annotate each variant with its type
both_hcr_mt = both_hcr_mt.annotate_rows(
.when((both_hcr_mt.alleles.length()==2) & (hl.is_snp(both_hcr_mt.alleles,both_hcr_mt.alleles)), ‘snp’)
.when((both_hcr_mt.alleles.length()==2) & (~hl.is_snp(both_hcr_mt.alleles,both_hcr_mt.alleles)), ‘non-snp’)
Extract the genotypes as a hail matrix table
both_hcr_gt_mt = both_hcr_mt.select_entries(both_hcr_mt.GT)
both_hcr_gt_mt = both_hcr_gt_mt.select_rows(both_hcr_gt_mt.var_type)
gt_ht = both_hcr_gt_mt.entries()
gt_ht = gt_ht.key_by(‘locus’, ‘alleles’)
Covert the genotype to a string
gt_ht = gt_ht.annotate(str_GT = hl.str(gt_ht.GT)+hl.literal(“/”)+hl.str(gt_ht.GT))
gt_ht = gt_ht.drop(‘GT’)
Generate a table with the genotype for each sample
gt_spread_ht = hl.experimental.spread(gt_ht, ‘s’, ‘str_GT’, key=‘var_type’)
Count the number of each genotype combination between the two samples
count_ht = gt_spread_ht
.group_by(gt_spread_ht.var_type, gt_spread_ht.dv, gt_spread_ht.dallv)
.aggregate(num = hl.agg.count())
And off to Pandas!
count_df = count_ht.to_pandas()
EDIT: oops! I missed your reply. Glad you find something that works!
GT is an entry field. That means there’s a value of GT for each sample at each variant. When you index a GT like GT you’re asking for the first allele of the genotype.
Hail is designed for datasets too large to fit in memory, so it doesn’t let you easily do things like get the genotype from a specific variant for every sample.
You can group the variants into variant types with something like:
mt = mt.group_rows_by(
n_alt_alleles = hl.agg.sum(mt.GT.n_alt_alleles)
If you want to count the number of hets and homs etc you could use a grouped aggregation.
mt = mt.group_rows_by(
n_alt_alleles = hl.agg.group_by(