UKBiobank chromosome XY


#1

Hi,

I don’t recall seeing anything about this, should we index UKBiobank chromosome XY in a special way (like recoding contig to X?), or we just process it like we would any other chromosome?

Also, for studies for which X chromosomes haven’t been split into X and XY, does Hail recognise non-pseudo-autosomal regions of the X and Y based on the genome build and adjust regressions accordingly?


#2

should we index UKBiobank chromosome XY in a special way

I think you’ll need to – XY isn’t a valid contig in GRCh37 or 38. I think mapping it to X will be fine.

Also, for studies for which X chromosomes haven’t been split into X and XY, does Hail recognise non-pseudo-autosomal regions of the X and Y based on the genome build and adjust regressions accordingly?

Hail doesn’t have any logic for sex chromosomes – the regression methods like linear_regression_rows require explicit formulation of the dependent and independent variables by the user. If you simply do:

hl.linear_regression_rows(y=mt.pheno, x=mt.GT.n_alt_alleles(), covariates=[1])

then the sex chromosomes won’t be handled properly.


#3

Hi Tim,

oh, I had not realised that… is there any tutorial or help anywhere showing how to handle sex chromosomes (i.e. haploid regions) properly?


#4

I made an issue to add a how-to here.

The reality is that it depends on your data – some formats / datasets encode sex chromosome genotypes as proper haploid genotypes, and some encode them as diploid (males would either be 0/0 or 1/1).

If your males are encoded as diploid, then linear_regression_rows(x=mt.GT.n_alt_alleles(), ...) is probably fine.

If they’re haploid, then I’m not sure what the right thing scientifically to do is. You could multiply the haploid allele dosage by 2 to make it comparable with female diploid dosages, but this may lead to problems.


#5

I see… is there a way in Hail to check how the male genotypes have been encoded for a small non-PAR region of the X?


#6

I think something like the following:

# filter_intervals makes the downstream filter faster
mt = hl.filter_intervals(mt, [hl.parse_locus_interval('X')])
mt = mt.filter_rows(mt.locus.in_x_nonpar())
print(mt.aggregate_entries(hl.agg.counter(mt.GT.ploidy)))


#7

Hi @tpoterba,

Apologies for the delay, teaching and marking…

Simpler than what I had in mind, thank you.

By the way, I think a “genome build” option should be added to functions such as in_x_nonpar, to prevent issues with newer builds.

As I’m not using the GT field, but the genotype probability, am I right to think Hail will not check how GT is coded, and that gp_dosage(ds.GP) will output something equivalent to diploid, and thus regressions should be ok?


#8

By the way, I think a “genome build” option should be added to functions such as in_x_nonpar, to prevent issues with newer builds.

We already have something better – the reference genome is part of the locus type:

In [6]: hl.parse_locus('X:154931045', reference_genome='GRCh37').in_x_par().show()
+--------+
| <expr> |
+--------+
| bool   |
+--------+
| true   |
+--------+

In [8]: hl.parse_locus('chrX:154931045', reference_genome='GRCh38').in_x_par().show()
+--------+
| <expr> |
+--------+
| bool   |
+--------+
| false  |
+--------+

am I right to think Hail will not check how GT is coded

If your GT is coded as a proper haploid, then gp_dosage(ds.GP) will be incorrect. If males are coded as 0/0 or 1/1 diploid, then it will be correct.


#9

looking back, I think you don’t actually have GT. I have no idea how your GP is coded, though.


#10

Directly in the locus, I see, great.

Humm, no, the males’ GT are coded 0 or 1, and the GP is 1,0 (so only two numbers instead of three).
I guess I’m gonna have to recode everything before the imputation, and see what happens, thanks.

Hopefully handling of haploid data is in the works for a future version?


#11

Hopefully handling of haploid data is in the works for a future version

You mean GWAS wrapper methods? That’s something I’d not be surprised to see. It would just be a couple of lines of code at the moment:

rg = mt.locus.dtype.reference_genome
x_contigs = set(rg.x_contigs)
y_contigs = set(rg.y_contigs)
autosomes = [c for c in r.contigs if c not in x_contigs and c not in y_contigs]
mt_auto = hl.filter_intervals(mt, [hl.parse_locus_interval(c, rg) for c in autosomes])
mt_x = hl.filter_intervals(mt, [hl.parse_locus_interval(c, rg) for c in x_contigs])
mt_y = hl.filter_intervals(mt, [hl.parse_locus_interval(c, rg) for c in y_contigs])

gwas_auto = hl.linear_regression_rows(x=hl.gp_dosage(mt_auto.GP),
    y=mt_auto.pheno,
    covariates=...)

x_chr_var = hl.case().when(mt_x.is_female, hl.gp_dosage(mt_x.GP)).default(hl.sum(mt_x.GP * [0, 2]))
gwas_x = hl.linear_regression_rows(x=x_chr_var,
    y=mt_x.pheno,
    covariates=...)

# only do regression on males in y
mt_y = mt_y.filter_cols(~mt_y.is_female)
gwas_y = hl.linear_regression_rows(x=mt_y.GP[1], # maybe?
    y=mt_y.pheno,
    covariates=...)

all_gwas = gwas_auto.union(gwas_x, gwas_y)

#12

no!!!


#13

Awesome, that’s great news! Thanks a lot for the code, I definitely wouldn’t have managed that by myself.

I’m not sure I get the .default(hl.sum(mt_x.GP * [0, 2])) part though, could you tell me what it does?


#14

there are only two cases there, that’s equivalent to

hl.cond(
    mt_x.is_female, 
    hl.gp_dosage(mt_x.GP),
    hl.sum(mt_x.GP * [0, 2]))

hl.gp_dosage(GP) is the same as hl.sum(GP * [0, 1, 2]). This is correct for diploid genotypes.

For haploid genotypes, hl.sum(GP * [0, 2]) will weight the alt allele 2x, so a [0.1, 0.9] would have dosage 1.8. I think this is what you want for comparing males and females on X, but not sure about the statistical validity.


#15

Oh, I see how it works now…

Thanks so much for your help.


#16

Sorry, one more thing, is it possible to add an “or” condition for the PAR regions in the .when() statement?
Something like this maybe?

x_chr_var = hl.case().when(mt_x.is_female | hl.parse_locus(mt.rows().locus, reference_genome=‘GRCh37’).in_x_par() , hl.gp_dosage(mt_x.GP)).default(hl.sum(mt_x.GP * [0, 2]))


#17

yes, totally. this should be mt_x.is_female | mt_x.locus.in_x_par() though


#18

although really the best thing would be to include the PAR with the autosomes, actually


#19

Neat, thanks.

When coding the PAR regions as PAR1/2 instead of X, it’s straightforward to group them with the autosomes, otherwise it does complicate things a bit; in this situation I think dealing with the X as a whole, using this when() function, is a nice touch. :slight_smile:


#20

I’m getting an error regarding is_female, and I don’t get why.

I definitely have a Boolean column field named is_female:

>>> mt_auto.describe()
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'sex': int32
    'PC1': float64
    'PC2': float64
    'PC3': float64
    'PC4': float64
    'PC5': float64
    'age': float64
    'weight': float64
    'height': float64
    'is_female': bool
    'pheno_case': bool

Yet, when I try to run the logistic regression, I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<decorator-gen-1048>", line 2, in logistic_regression_rows
  File "/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py", line 560, in wrapper
  File "/usr/local/hail/build/distributions/hail-python.zip/hail/methods/statgen.py", line 672, in logistic_regression_rows
  File "<decorator-gen-782>", line 2, in _select_all
  File "/usr/local/hail/build/distributions/hail-python.zip/hail/typecheck/check.py", line 560, in wrapper
  File "/usr/local/hail/build/distributions/hail-python.zip/hail/matrixtable.py", line 2581, in _select_all
  File "/usr/local/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/local/hail/build/distributions/hail-python.zip/hail/utils/java.py", line 210, in deco
hail.utils.java.FatalError: AssertionError: assertion failed: is_female not in struct{__y: float64, __cov0: float64, __cov1: float64, __cov2: float64, __cov3: float64, __cov4: float64, __cov5: float64, __cov6: float64, __cov7: float64, __cov8: float64}

Any idea?