I want to remove a snp from the regression equation

I want to remove a specific snp or snps from the regression equation and add it as a covarite

Thank you

Here’s code that will take SNP IDs and run regression using them as a covariate:

my_snps = ['1:1:A:T', '16:29102912:T:TTT']

mt_filt = mt.filter_rows(mt_snps_lit.contains(mt.row_key))
mt_filt = mt.annotate_globals(
    snps = hl.set([hl.parse_variant(x) for x in my_snps]))
mt_filt = mt_filt.filter_rows(mt_filt.snps.contains(mt_filt.row_key))

sample_genos = mt_filt.annotate_cols(
    genotypes = hl.agg.collect(mt_filt.GT.n_alt_alleles()))
mt = mt.annotate_cols(
    snp_covs = sample_genos.cols()[mt.x].genotypes

gwas = hl.linear_regression_rows(
    covariates=[1, mt.age, mt.PC1, mt.PC2, *(mt.snp_covs[i] for i in range(len(my_snps)))])
1 Like

Thank you so much


one follow-up question, if one wants to use GP instead of GT, all there is to do is change sample_genos like this:

sample_genos = mt_filt.annotate_cols(genotypes = hl.agg.collect(hl.gp_dosage(mt_filt.GP)))

Is that correct?

yup, exactly.

Also note that the second line (mt_snps_lit = hl.literal([hl.parse_variant(x) for x in my_snps])) doesn’t actually work due to a Python limitation. I’ll think about a fix

OK, edited. should work now

Great, thank you.

Sorry, one more question. In the UK Biobank, X and XY genotypes are given only for a subset of the samples, so hl.MatrixTable.union_rows doesn’t work. That’s fine, X and XY can be analysed separately, but what to do if one needs to use the genotype of a variant located on an autosome as a covariate?

mt = mt.annotate_cols(snp_covs = sample_genos.cols()[mt.x].genotypes)

applied on X or XY only annotate the samples in common or will one run into trouble?

if you’ve got autosomes, x, and xy as separate matrix tables, and you have snps on each that you want to use as covariates, then you should build one sample_genos for each, and annotate each matrix table with all 3 results before regression, I think.