UKBiobank chromosome XY

Hi,

I don’t recall seeing anything about this, should we index UKBiobank chromosome XY in a special way (like recoding contig to X?), or we just process it like we would any other chromosome?

Also, for studies for which X chromosomes haven’t been split into X and XY, does Hail recognise non-pseudo-autosomal regions of the X and Y based on the genome build and adjust regressions accordingly?

1 Like

should we index UKBiobank chromosome XY in a special way

I think you’ll need to – XY isn’t a valid contig in GRCh37 or 38. I think mapping it to X will be fine.

Also, for studies for which X chromosomes haven’t been split into X and XY, does Hail recognise non-pseudo-autosomal regions of the X and Y based on the genome build and adjust regressions accordingly?

Hail doesn’t have any logic for sex chromosomes – the regression methods like linear_regression_rows require explicit formulation of the dependent and independent variables by the user. If you simply do:

hl.linear_regression_rows(y=mt.pheno, x=mt.GT.n_alt_alleles(), covariates=[1])

then the sex chromosomes won’t be handled properly.

Hi Tim,

oh, I had not realised that… is there any tutorial or help anywhere showing how to handle sex chromosomes (i.e. haploid regions) properly?

I made an issue to add a how-to here.

The reality is that it depends on your data – some formats / datasets encode sex chromosome genotypes as proper haploid genotypes, and some encode them as diploid (males would either be 0/0 or 1/1).

If your males are encoded as diploid, then linear_regression_rows(x=mt.GT.n_alt_alleles(), ...) is probably fine.

If they’re haploid, then I’m not sure what the right thing scientifically to do is. You could multiply the haploid allele dosage by 2 to make it comparable with female diploid dosages, but this may lead to problems.

I see… is there a way in Hail to check how the male genotypes have been encoded for a small non-PAR region of the X?

I think something like the following:

# filter_intervals makes the downstream filter faster
mt = hl.filter_intervals(mt, [hl.parse_locus_interval('X')])
mt = mt.filter_rows(mt.locus.in_x_nonpar())
print(mt.aggregate_entries(hl.agg.counter(mt.GT.ploidy)))

Hi @tpoterba,

Apologies for the delay, teaching and marking…

Simpler than what I had in mind, thank you.

By the way, I think a “genome build” option should be added to functions such as in_x_nonpar, to prevent issues with newer builds.

As I’m not using the GT field, but the genotype probability, am I right to think Hail will not check how GT is coded, and that gp_dosage(ds.GP) will output something equivalent to diploid, and thus regressions should be ok?

By the way, I think a “genome build” option should be added to functions such as in_x_nonpar, to prevent issues with newer builds.

We already have something better – the reference genome is part of the locus type:

In [6]: hl.parse_locus('X:154931045', reference_genome='GRCh37').in_x_par().show()
+--------+
| <expr> |
+--------+
| bool   |
+--------+
| true   |
+--------+

In [8]: hl.parse_locus('chrX:154931045', reference_genome='GRCh38').in_x_par().show()
+--------+
| <expr> |
+--------+
| bool   |
+--------+
| false  |
+--------+

am I right to think Hail will not check how GT is coded

If your GT is coded as a proper haploid, then gp_dosage(ds.GP) will be incorrect. If males are coded as 0/0 or 1/1 diploid, then it will be correct.

looking back, I think you don’t actually have GT. I have no idea how your GP is coded, though.

Directly in the locus, I see, great.

Humm, no, the males’ GT are coded 0 or 1, and the GP is 1,0 (so only two numbers instead of three).
I guess I’m gonna have to recode everything before the imputation, and see what happens, thanks.

Hopefully handling of haploid data is in the works for a future version?

Hopefully handling of haploid data is in the works for a future version

You mean GWAS wrapper methods? That’s something I’d not be surprised to see. It would just be a couple of lines of code at the moment:

rg = mt.locus.dtype.reference_genome
x_contigs = set(rg.x_contigs)
y_contigs = set(rg.y_contigs)
autosomes = [c for c in r.contigs if c not in x_contigs and c not in y_contigs]
mt_auto = hl.filter_intervals(mt, [hl.parse_locus_interval(c, rg) for c in autosomes])
mt_x = hl.filter_intervals(mt, [hl.parse_locus_interval(c, rg) for c in x_contigs])
mt_y = hl.filter_intervals(mt, [hl.parse_locus_interval(c, rg) for c in y_contigs])

gwas_auto = hl.linear_regression_rows(x=hl.gp_dosage(mt_auto.GP),
    y=mt_auto.pheno,
    covariates=...)

x_chr_var = hl.case().when(mt_x.is_female, hl.gp_dosage(mt_x.GP)).default(hl.sum(mt_x.GP * [0, 2]))
gwas_x = hl.linear_regression_rows(x=x_chr_var,
    y=mt_x.pheno,
    covariates=...)

# only do regression on males in y
mt_y = mt_y.filter_cols(~mt_y.is_female)
gwas_y = hl.linear_regression_rows(x=mt_y.GP[1], # maybe?
    y=mt_y.pheno,
    covariates=...)

all_gwas = gwas_auto.union(gwas_x, gwas_y)
1 Like

no!!!

Awesome, that’s great news! Thanks a lot for the code, I definitely wouldn’t have managed that by myself.

I’m not sure I get the .default(hl.sum(mt_x.GP * [0, 2])) part though, could you tell me what it does?

there are only two cases there, that’s equivalent to

hl.cond(
    mt_x.is_female, 
    hl.gp_dosage(mt_x.GP),
    hl.sum(mt_x.GP * [0, 2]))

hl.gp_dosage(GP) is the same as hl.sum(GP * [0, 1, 2]). This is correct for diploid genotypes.

For haploid genotypes, hl.sum(GP * [0, 2]) will weight the alt allele 2x, so a [0.1, 0.9] would have dosage 1.8. I think this is what you want for comparing males and females on X, but not sure about the statistical validity.

1 Like

Oh, I see how it works now…

Thanks so much for your help.

Sorry, one more thing, is it possible to add an “or” condition for the PAR regions in the .when() statement?
Something like this maybe?

x_chr_var = hl.case().when(mt_x.is_female | hl.parse_locus(mt.rows().locus, reference_genome=‘GRCh37’).in_x_par() , hl.gp_dosage(mt_x.GP)).default(hl.sum(mt_x.GP * [0, 2]))

yes, totally. this should be mt_x.is_female | mt_x.locus.in_x_par() though

1 Like

although really the best thing would be to include the PAR with the autosomes, actually

Neat, thanks.

When coding the PAR regions as PAR1/2 instead of X, it’s straightforward to group them with the autosomes, otherwise it does complicate things a bit; in this situation I think dealing with the X as a whole, using this when() function, is a nice touch. :slight_smile:

I’m getting an error regarding is_female, and I don’t get why.

I definitely have a Boolean column field named is_female:

>>> mt_auto.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'sex': int32
    'PC1': float64
    'PC2': float64
    'PC3': float64
    'PC4': float64
    'PC5': float64
    'age': float64
    'weight': float64
    'height': float64
    'is_female': bool
    'pheno_case': bool

Yet, when I try to run the logistic regression, I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-1048>", line 2, in logistic_regression_rows
  File "/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py", line 560, in wrapper
  File "/usr/local/hail/build/distributions/hail-python.zip/hail/methods/statgen.py", line 672, in logistic_regression_rows
  File "<decorator-gen-782>", line 2, in _select_all
  File "/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py", line 560, in wrapper
  File "/usr/local/hail/build/distributions/hail-python.zip/hail/matrixtable.py", line 2581, in _select_all
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/local/hail/build/distributions/hail-python.zip/hail/utils/java.py", line 210, in deco
hail.utils.java.FatalError: AssertionError: assertion failed: is_female not in struct{__y: float64, __cov0: float64, __cov1: float64, __cov2: float64, __cov3: float64, __cov4: float64, __cov5: float64, __cov6: float64, __cov7: float64, __cov8: float64}

Any idea?