I want to remove a snp from the regression equation


#1

Hi
I want to remove a specific snp or snps from the regression equation and add it as a covarite

Thank you


#2

Here’s code that will take SNP IDs and run regression using them as a covariate:


my_snps = ['1:1:A:T', '16:29102912:T:TTT']

mt_filt = mt.filter_rows(mt_snps_lit.contains(mt.row_key))
mt_filt = mt.annotate_globals(
    snps = hl.set([hl.parse_variant(x) for x in my_snps]))
mt_filt = mt_filt.filter_rows(mt_filt.snps.contains(mt_filt.row_key))

sample_genos = mt_filt.annotate_cols(
    genotypes = hl.agg.collect(mt_filt.GT.n_alt_alleles()))
mt = mt.annotate_cols(
    snp_covs = sample_genos.cols()[mt.x].genotypes
)

gwas = hl.linear_regression_rows(
    mt.GT.n_alt_alleles(),
    mt.pheno,
    covariates=[1, mt.age, mt.PC1, mt.PC2, *(mt.snp_covs[i] for i in range(len(my_snps)))])

#3

Thank you so much


#4

@tpoterba

one follow-up question, if one wants to use GP instead of GT, all there is to do is change sample_genos like this:

sample_genos = mt_filt.annotate_cols(genotypes = hl.agg.collect(hl.gp_dosage(mt_filt.GP)))

Is that correct?


#5

yup, exactly.

Also note that the second line (mt_snps_lit = hl.literal([hl.parse_variant(x) for x in my_snps])) doesn’t actually work due to a Python limitation. I’ll think about a fix


#6

OK, edited. should work now


#7

Great, thank you.


#8

Sorry, one more question. In the UK Biobank, X and XY genotypes are given only for a subset of the samples, so hl.MatrixTable.union_rows doesn’t work. That’s fine, X and XY can be analysed separately, but what to do if one needs to use the genotype of a variant located on an autosome as a covariate?
Will

mt = mt.annotate_cols(snp_covs = sample_genos.cols()[mt.x].genotypes)

applied on X or XY only annotate the samples in common or will one run into trouble?


#9

if you’ve got autosomes, x, and xy as separate matrix tables, and you have snps on each that you want to use as covariates, then you should build one sample_genos for each, and annotate each matrix table with all 3 results before regression, I think.