Linear regression define subsets of phenotypes

Hi hail team,
I have a very big list of phenotypes in my matrixtable and want to run linear regression in a subset of these at a time.
I cannot seem to be able to define these subsets by by the name of the phenotype in the matrixtable when defining the y list of phenotypes. Any help please towards the right direction?

mt.phenotype.describe()
--------------------------------------------------------
Type:
        struct {
        ...
        fbc_neut_p: float64, 
        fbc_eo_p: float64, 
        fbc_mono_p: float64, 
        fbc_lymph_p: float64, 
        fbc_baso_p: float64, 
        fbc_ret_p: float64, 
        fbc_hlr_p: float64, 
        fbc_hct: float64, 
        fbc_pct: float64, 
        fbc_hgb: float64, 
        fbc_rbc: float64, 
        fbc_wbc: float64, 
        fbc_mpv: float64, 
        fbc_plt: float64, 
        fbc_rdw: float64, 
        fbc_pdw: float64, 
        fbc_mcv: float64, 
        fbc_mch: float64, 
        fbc_mchc: float64, ...
   

This works:

phens=[mt.phenotype[i] for i in range(10,100)]

gwas = hl.linear_regression_rows(
        y=phens,
        x=mt.GT.n_alt_alleles(), covariates=[1.0, pcas1[0:3][0]], pass_through=[mt.rsid])
for i in range(0, 100):
        print(f"Plotting {i}:{phens[i]}")
        p = hl.plot.manhattan(gwas.p_value[i], title=f"Interval WGS GWAS Manhattan Plot: {phens[i]}")
        output_file(f"{i}.WGS-manhattan-{phens[i]}.html")
        save(p)
        #p.show()

What I am trying to do is select the phenotypes based on their name:

for pheno in p:
    if pheno.startswith('fbc'):
        fbc.append(pheno)

fbc1=mt.phenotype.select(*fbc)
#THIS does not work:
gwas = hl.linear_regression_rows(
        y=[fbc1],
        x=mt.GT.n_alt_alleles(), covariates=[1.0, pcas1[0:10][0]], pass_through=[mt.rsid])

I get this error:

TypeError: linear_regression_rows: parameter 'y': expected (expression of type float64 
or Sequence[expression of type float64] or Sequence[Sequence[expression of type float64]]), 
found list: 
['fbc_neut_p', 'fbc_eo_p', 'fbc_mono_p', 'fbc_lymph_p', 'fbc_baso_p', 'fbc_ret_p', 'fbc_hlr_p', 'fbc_hct', 'fbc_pct', 'fbc_hgb', 'fbc_rbc', 'fbc_wbc', 'fbc_mpv', 'fbc_plt', 'fbc_rdw', 'fbc_pdw', 'fbc_mcv', 'fbc_mch', 'fbc_mchc', 'fbc_ret', 'fbc_hlr', 'fbc_neut', 'fbc_mono', 'fbc_baso', 'fbc_eo', 'fbc_lymph', 'fbc_irf', 'fbc_myeloid_wbc', 'fbc_gran', 'fbc_eo_baso_sum', 'fbc_neut_eo_sum', 'fbc_baso_neut_sum', 'fbc_gran_p_myeloid_wbc', 'fbc_eo_p_gran', 'fbc_neut_p_gran', 'fbc_baso_p_gran']

I want to use this fbc list to select the phenotypes from the matrixtable and use them as the y variable instead of explicitly defining them by typing them one by one. That will let me print their names in my plots too.

Any help please? Thank you

I think you can do:


for pheno in p:
    if pheno.startswith('fbc'):
        fbc.append(mt.phenotype[pheno])

gwas = hl.linear_regression_rows(
        y=fbc,
        x=mt.GT.n_alt_alleles(), covariates=[1.0, pcas1[0:10][0]], pass_through=[mt.rsid])

Thank you very much once again Tim, that’s worked.

Hi Hail team,
on the same gwas topic please, do you have any recommendations on reducing the size of the gwas_table.export() output table when we have a lot of phenotypes? I have 13 million variants with 4000 phenotypes and although the hail table is saved to disk quickly the export .tsv table becomes 2TB in size and it has not finished exporting after 6 hours.

Thank you,
Pavlos

I would recommend not creating an all-phenotypes TSV file, but rather leaving things as a Hail table until specific phenotypes need to be distributed.

Thanks Tim.
Is there a way to run gwas_linear_regression_rows and include samples that do not have a value for all measurements?
I am running the following command on a set of 230 phenotypes:

        y=[nmr],
        x=mt.GT.n_alt_alleles(), covariates=[1.0]+pcas, pass_through=[mt.rsid])
WARN: 10956 of 11863 samples have a missing phenotype or covariate.
2019-12-18 16:14:34 Hail: INFO: linear_regression_rows[0]: running on 907 samples for 230 response variables y,

The above runs only on 907 samples which contain NO missing values in all 230 measurements of the nmr phenotypes.
But we have 11400 samples that have measurements in subsets of this nmr list of phenotypes. Is there a way I can keep them in the analysis?

Thanks,
Pavlos

Hi Pavlos, take a look at the docs for linear_regression_rows (especially the orange note). We subset to defined samples within each phenotype group so that we can use a very efficient algorithm to compute the solution, but if you pass a list of lists, then each inner list is treated as a separate group:

y = [[pheno] for pheno in nmr]

Hi Tim,
thanks, I have seen the docs. But this requires me if I understand correctly to know in advance the list of phenotypes for which i have no missing data so I can use them as one individual list. And make additional lists for subsets of phenotypes to include more samples that are not missing any measurements in those sublists?

Yes, I think that assessment is correct.

We’ve been discussing possible algorithms to automatically compute roughly optimal phenotype groupings – I’ll keep you posted!