I don’t recall seeing anything about this, should we index UKBiobank chromosome XY in a special way (like recoding contig to X?), or we just process it like we would any other chromosome?
Also, for studies for which X chromosomes haven’t been split into X and XY, does Hail recognise non-pseudo-autosomal regions of the X and Y based on the genome build and adjust regressions accordingly?
should we index UKBiobank chromosome XY in a special way
I think you’ll need to – XY isn’t a valid contig in GRCh37 or 38. I think mapping it to X will be fine.
Also, for studies for which X chromosomes haven’t been split into X and XY, does Hail recognise non-pseudo-autosomal regions of the X and Y based on the genome build and adjust regressions accordingly?
Hail doesn’t have any logic for sex chromosomes – the regression methods like linear_regression_rows require explicit formulation of the dependent and independent variables by the user. If you simply do:
The reality is that it depends on your data – some formats / datasets encode sex chromosome genotypes as proper haploid genotypes, and some encode them as diploid (males would either be 0/0 or 1/1).
If your males are encoded as diploid, then linear_regression_rows(x=mt.GT.n_alt_alleles(), ...) is probably fine.
If they’re haploid, then I’m not sure what the right thing scientifically to do is. You could multiply the haploid allele dosage by 2 to make it comparable with female diploid dosages, but this may lead to problems.
By the way, I think a “genome build” option should be added to functions such as in_x_nonpar, to prevent issues with newer builds.
As I’m not using the GT field, but the genotype probability, am I right to think Hail will not check how GT is coded, and that gp_dosage(ds.GP) will output something equivalent to diploid, and thus regressions should be ok?
Humm, no, the males’ GT are coded 0 or 1, and the GP is 1,0 (so only two numbers instead of three).
I guess I’m gonna have to recode everything before the imputation, and see what happens, thanks.
Hopefully handling of haploid data is in the works for a future version?
Hopefully handling of haploid data is in the works for a future version
You mean GWAS wrapper methods? That’s something I’d not be surprised to see. It would just be a couple of lines of code at the moment:
rg = mt.locus.dtype.reference_genome
x_contigs = set(rg.x_contigs)
y_contigs = set(rg.y_contigs)
autosomes = [c for c in r.contigs if c not in x_contigs and c not in y_contigs]
mt_auto = hl.filter_intervals(mt, [hl.parse_locus_interval(c, rg) for c in autosomes])
mt_x = hl.filter_intervals(mt, [hl.parse_locus_interval(c, rg) for c in x_contigs])
mt_y = hl.filter_intervals(mt, [hl.parse_locus_interval(c, rg) for c in y_contigs])
gwas_auto = hl.linear_regression_rows(x=hl.gp_dosage(mt_auto.GP),
y=mt_auto.pheno,
covariates=...)
x_chr_var = hl.case().when(mt_x.is_female, hl.gp_dosage(mt_x.GP)).default(hl.sum(mt_x.GP * [0, 2]))
gwas_x = hl.linear_regression_rows(x=x_chr_var,
y=mt_x.pheno,
covariates=...)
# only do regression on males in y
mt_y = mt_y.filter_cols(~mt_y.is_female)
gwas_y = hl.linear_regression_rows(x=mt_y.GP[1], # maybe?
y=mt_y.pheno,
covariates=...)
all_gwas = gwas_auto.union(gwas_x, gwas_y)
hl.gp_dosage(GP) is the same as hl.sum(GP * [0, 1, 2]). This is correct for diploid genotypes.
For haploid genotypes, hl.sum(GP * [0, 2]) will weight the alt allele 2x, so a [0.1, 0.9] would have dosage 1.8. I think this is what you want for comparing males and females on X, but not sure about the statistical validity.
When coding the PAR regions as PAR1/2 instead of X, it’s straightforward to group them with the autosomes, otherwise it does complicate things a bit; in this situation I think dealing with the X as a whole, using this when() function, is a nice touch.
Yet, when I try to run the logistic regression, I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<decorator-gen-1048>", line 2, in logistic_regression_rows
File "/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py", line 560, in wrapper
File "/usr/local/hail/build/distributions/hail-python.zip/hail/methods/statgen.py", line 672, in logistic_regression_rows
File "<decorator-gen-782>", line 2, in _select_all
File "/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py", line 560, in wrapper
File "/usr/local/hail/build/distributions/hail-python.zip/hail/matrixtable.py", line 2581, in _select_all
File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/local/hail/build/distributions/hail-python.zip/hail/utils/java.py", line 210, in deco
hail.utils.java.FatalError: AssertionError: assertion failed: is_female not in struct{__y: float64, __cov0: float64, __cov1: float64, __cov2: float64, __cov3: float64, __cov4: float64, __cov5: float64, __cov6: float64, __cov7: float64, __cov8: float64}