Aggregating over genotypes

Hi,
I have 2 basic questions that came up while I want to sum allels over different samples.

  1. how can I convert vcf-like genotypes like 0/0,0/1,1/1 to 0,1,2 ? Then aggregating would be straight-forward.

alternatively

  1. Lets consider a phenotype table where I assign my samples to different groups. How could I do the aggregation (sum) of genotypes per group ?

Thanks for your help.

  1. There’s a method on Call fields called n_alt_alleles(). For example, mt.GT.n_alt_alleles(). This is equal to the integer number of non-reference alleles in the call, so 0,1,2 for your example.

  2. Do you want to compute a statistic per variant or per sample? There’s a group_by aggregator that can compute any aggregation for each member of a grouping value.

Thanks that already helped. For 1. I could do:

mtf = mt.select_entries(GT = mt.GT.n_alt_alleles())
mtf.make_table().export("data.tsv")

which works out fine.

Trying it similarly for 2. , I want to compute the statistic per variant. In principle, I get what I want when using:

mtf = mtf.annotate_entries(sumof_allels = mtf.GT.n_alt_alleles())
mtf=mtf.group_cols_by(mtf.pheno.ID).aggregate(allele_sum=hl.agg.sum(mtf.sumof_allels))
mtf.show()

however, when I want to generate the ht and do the export, like above, I get the following error:

ValueError                                Traceback (most recent call last)
/tmp/ipykernel_1422147/3525226286.py in <module>
      4 mtf=mtf.group_cols_by(mtf.pheno.ID).aggregate(allele_sum=hl.agg.sum(mtf.sumof_allels))
      5 mtf.show()
----> 6 ht=mtf.make_table()

<decorator-gen-1322> in make_table(self, separator)

~/anaconda3/lib/python3.9/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

~/anaconda3/lib/python3.9/site-packages/hail/matrixtable.py in make_table(self, separator)
   4096         counts = Counter(col_keys)
   4097         if counts[None] > 0:
-> 4098             raise ValueError("'make_table' encountered a missing column key; ensure all identifiers are defined.\n"
   4099                              "  To fill in key index, run:\n"
   4100                              "    mt = mt.key_cols_by(ck = hl.coalesce(mt.COL_KEY_NAME, 'missing_' + hl.str(hl.scan.count())))")

ValueError: 'make_table' encountered a missing column key; ensure all identifiers are defined.
  To fill in key index, run:
    mt = mt.key_cols_by(ck = hl.coalesce(mt.COL_KEY_NAME, 'missing_' + hl.str(hl.scan.count())))

I have already tried to solve it by using key_by, but did not yet succeed.

You could also try:

mtf.GT.export('data.tsv')
1 Like