Linear regression define subsets of phenotypes

pavlos · December 12, 2019, 4:52pm

Hi hail team,
I have a very big list of phenotypes in my matrixtable and want to run linear regression in a subset of these at a time.
I cannot seem to be able to define these subsets by by the name of the phenotype in the matrixtable when defining the y list of phenotypes. Any help please towards the right direction?

mt.phenotype.describe()
--------------------------------------------------------
Type:
        struct {
        ...
        fbc_neut_p: float64, 
        fbc_eo_p: float64, 
        fbc_mono_p: float64, 
        fbc_lymph_p: float64, 
        fbc_baso_p: float64, 
        fbc_ret_p: float64, 
        fbc_hlr_p: float64, 
        fbc_hct: float64, 
        fbc_pct: float64, 
        fbc_hgb: float64, 
        fbc_rbc: float64, 
        fbc_wbc: float64, 
        fbc_mpv: float64, 
        fbc_plt: float64, 
        fbc_rdw: float64, 
        fbc_pdw: float64, 
        fbc_mcv: float64, 
        fbc_mch: float64, 
        fbc_mchc: float64, ...

This works:

phens=[mt.phenotype[i] for i in range(10,100)]

gwas = hl.linear_regression_rows(
        y=phens,
        x=mt.GT.n_alt_alleles(), covariates=[1.0, pcas1[0:3][0]], pass_through=[mt.rsid])
for i in range(0, 100):
        print(f"Plotting {i}:{phens[i]}")
        p = hl.plot.manhattan(gwas.p_value[i], title=f"Interval WGS GWAS Manhattan Plot: {phens[i]}")
        output_file(f"{i}.WGS-manhattan-{phens[i]}.html")
        save(p)
        #p.show()

What I am trying to do is select the phenotypes based on their name:

for pheno in p:
    if pheno.startswith('fbc'):
        fbc.append(pheno)

fbc1=mt.phenotype.select(*fbc)
#THIS does not work:
gwas = hl.linear_regression_rows(
        y=[fbc1],
        x=mt.GT.n_alt_alleles(), covariates=[1.0, pcas1[0:10][0]], pass_through=[mt.rsid])

I get this error:

TypeError: linear_regression_rows: parameter 'y': expected (expression of type float64 
or Sequence[expression of type float64] or Sequence[Sequence[expression of type float64]]), 
found list: 
['fbc_neut_p', 'fbc_eo_p', 'fbc_mono_p', 'fbc_lymph_p', 'fbc_baso_p', 'fbc_ret_p', 'fbc_hlr_p', 'fbc_hct', 'fbc_pct', 'fbc_hgb', 'fbc_rbc', 'fbc_wbc', 'fbc_mpv', 'fbc_plt', 'fbc_rdw', 'fbc_pdw', 'fbc_mcv', 'fbc_mch', 'fbc_mchc', 'fbc_ret', 'fbc_hlr', 'fbc_neut', 'fbc_mono', 'fbc_baso', 'fbc_eo', 'fbc_lymph', 'fbc_irf', 'fbc_myeloid_wbc', 'fbc_gran', 'fbc_eo_baso_sum', 'fbc_neut_eo_sum', 'fbc_baso_neut_sum', 'fbc_gran_p_myeloid_wbc', 'fbc_eo_p_gran', 'fbc_neut_p_gran', 'fbc_baso_p_gran']

I want to use this fbc list to select the phenotypes from the matrixtable and use them as the y variable instead of explicitly defining them by typing them one by one. That will let me print their names in my plots too.

Any help please? Thank you

tpoterba · December 12, 2019, 6:40pm

I think you can do:


for pheno in p:
    if pheno.startswith('fbc'):
        fbc.append(mt.phenotype[pheno])

gwas = hl.linear_regression_rows(
        y=fbc,
        x=mt.GT.n_alt_alleles(), covariates=[1.0, pcas1[0:10][0]], pass_through=[mt.rsid])

pavlos · December 13, 2019, 2:19pm

Thank you very much once again Tim, that’s worked.

pavlos · December 18, 2019, 9:36am

Hi Hail team,
on the same gwas topic please, do you have any recommendations on reducing the size of the gwas_table.export() output table when we have a lot of phenotypes? I have 13 million variants with 4000 phenotypes and although the hail table is saved to disk quickly the export .tsv table becomes 2TB in size and it has not finished exporting after 6 hours.

Thank you,
Pavlos

tpoterba · December 18, 2019, 1:01pm

I would recommend not creating an all-phenotypes TSV file, but rather leaving things as a Hail table until specific phenotypes need to be distributed.

pavlos · December 18, 2019, 3:46pm

Thanks Tim.
Is there a way to run gwas_linear_regression_rows and include samples that do not have a value for all measurements?
I am running the following command on a set of 230 phenotypes:

        y=[nmr],
        x=mt.GT.n_alt_alleles(), covariates=[1.0]+pcas, pass_through=[mt.rsid])

WARN: 10956 of 11863 samples have a missing phenotype or covariate.
2019-12-18 16:14:34 Hail: INFO: linear_regression_rows[0]: running on 907 samples for 230 response variables y,

The above runs only on 907 samples which contain NO missing values in all 230 measurements of the nmr phenotypes.
But we have 11400 samples that have measurements in subsets of this nmr list of phenotypes. Is there a way I can keep them in the analysis?

Thanks,
Pavlos

tpoterba · December 18, 2019, 4:32pm

Hi Pavlos, take a look at the docs for linear_regression_rows (especially the orange note). We subset to defined samples within each phenotype group so that we can use a very efficient algorithm to compute the solution, but if you pass a list of lists, then each inner list is treated as a separate group:

y = [[pheno] for pheno in nmr]

pavlos · December 18, 2019, 4:36pm

Hi Tim,
thanks, I have seen the docs. But this requires me if I understand correctly to know in advance the list of phenotypes for which i have no missing data so I can use them as one individual list. And make additional lists for subsets of phenotypes to include more samples that are not missing any measurements in those sublists?

tpoterba · December 18, 2019, 4:52pm

Yes, I think that assessment is correct.

We’ve been discussing possible algorithms to automatically compute roughly optimal phenotype groupings – I’ll keep you posted!

Topic		Replies	Views
Regression with multiple phenotypes with varying degrees of missingness Hail Query & hailctl	5	673	April 27, 2020
GWAS on subset of UKBioBank Hail Query & hailctl	26	1573	July 13, 2021
[Feature] Chained linear regression Updates	0	982	October 26, 2018
PheWAS on DNAnexus UKB RAP Hail Query & hailctl	13	853	December 21, 2022
Functions similar to linreg3 and hl.map in HAIL 0.2 Help [0.1]	6	894	July 17, 2018

Linear regression define subsets of phenotypes

Related topics