Annotating samples with a specific genotype dosage

Hi,

I’m trying to find the right way to annotate samples with a specific genotype dosage, to be used later as a genetic covariate in regressions, but can’t find the right way of doing it. At the moment I’m exploring something like this and then exporting it to a table to annotate the vds:

vds.filter_variants_list([Variant.parse(‘1:123456:G:A’)]).annotate_samples_expr(‘sa.gCOV1 = g.dosage()’)

Help please.

g isn’t in scope in annotate_samples_expr, you’ll need to use gs:


vds2 = vds.filter_variants_list([Variant.parse('1:123456:G:A')])\
          .annotate_samples_expr('sa.gCOV1 = gs.collect()[0].dosage()')

vds = vds.annotate_samples_table(vds2.samples_table(), expr='va.gCOV1 = table.gCOV1')

I could be slightly wrong with syntax, but this model will definitely work.

1 Like

gs I see… didn’t think of it. Thanks a lot :slight_smile:

One more thing I’m just thinking of (sorry), would that work if the genotypes are probabilities and not hard calls?

I don’t think this would work with anything other than BGEN data, actually, since we call .dosage() after collecting the genotype.

Wouldn’t dosage be calculated on the fly in other cases? If not, is there a way in Hail to populate the dosage field using calculation from the hard calls?

In the next version of Hail, as it will be possible to import the GP field from vcf files, will this be stored in Hail as dosage or will you keep the actual probabilities? (which would make more sense, and that’s also true of probabilities in bgens for that matter).

I think dosage() will be missing if called on a hard-call. This probably isn’t ideal!

In 0.2, the dependence on the 5 standard GATK fields (GT, AD, DP, GQ, PL) goes away. It will be super easy to use the GP field, which will be kept as the actual probabilities until you explicitly transform it to allele dosage.

That’s great, thanks.