Empty logistic regression

Stephane_Bourgeois · November 2, 2017, 1:10pm

Hi,

I did a logistic regression as follow:

gwas = vds.logreg('wald', 'sa.pheno.E', covariates=['sa.pheno.age', 'sa.pheno.weight', 'sa.pheno.sex', 'sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.pca.PC4', 'sa.pca.PC5'], root='va.logreg', use_dosages=True)
gwas.write('/mnt/volume/regressions/E.regression1.vds')

…
2017-10-23 18:14:42 Hail: INFO: Reading table to impute column types
2017-10-23 18:14:42 Hail: INFO: Finished type imputation
Loading column SID' as type String (imputed) Loading column PC1’ as type Double (imputed)
Loading column PC2' as type Double (imputed) Loading column PC3’ as type Double (imputed)
Loading column PC4' as type Double (imputed) Loading column PC5’ as type Double (imputed)
[Stage 10:===================================================>(1794 + 1) / 1795]2017-10-23 19:05:17 Hail: WARN: 154 of 7135 samples have a missing phenotype or covariate.
2017-10-23 19:05:17 Hail: INFO: Running wald logistic regression on 6981 samples with 9 covariates including intercept…
[Stage 11:===================================================>(1794 + 1) / 1795]Read/write took: 7:27:35.357489
GWAS took: 2 days, 19:36:55.387864

gwas = hc.read('/mnt/volume/regressions/E.regression1.vds')
gwas.filter_variants_expr('isnan(va.logreg.pval)', keep=False).export_variants('/mnt/volume/plots/E.regression1.4manhattan', 'CHR = v.contig, POS = v.start, P = va.logreg.pval')

But the E.regression1.4manhattan only contains the header. What could have happened here? What should I check?

tpoterba · November 2, 2017, 1:36pm

What happens if you just do:

print(gwas.filter_variants_expr('isnan(va.logreg.pval)', keep=False).count())

tpoterba · November 2, 2017, 1:37pm

Could be an export problem, an issue with isnan (I remember this has been a problem in the past) or something else entirely, so this will help us figure that out.

Stephane_Bourgeois · November 2, 2017, 1:43pm

Humm, this seems empty:

print(gwas.filter_variants_expr(‘isnan(va.logreg.pval)’, keep=False).count())
[Stage 19:===================================================>(1784 + 8) / 1795](7135L, 0L)

tpoterba · November 2, 2017, 1:44pm

what about pre-filter?

tpoterba · November 2, 2017, 1:45pm

Next thing I’d do is

gwas.variants_table().annotate('logreg = va.logreg')\
    .select(['v', 'logreg'])\
    .show(25)

just to see if all the pvals are actually nan.

Stephane_Bourgeois · November 2, 2017, 2:02pm

Only NAs:

gwas.variants_table().annotate(‘logreg = va.logreg’).select([‘v’, ‘logreg’]).show(25)
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| v | logreg.beta | logreg.se | logreg.zstat | logreg.pval | logreg.fit.nIter | logreg.fit.converged | logreg.fit.exploded |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| Variant | Double | Double | Double | Double | Int | Boolean | Boolean |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| 1:768448:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:768819:C:T | NA | NA | NA | NA | 26 | false | false |
| 1:769223:C:G | NA | NA | NA | NA | 26 | false | false |
| 1:769885:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:770085:G:C | NA | NA | NA | NA | 26 | false | false |
| 1:770181:A:G | NA | NA | NA | NA | 26 | false | false |
| 1:770377:A:T | NA | NA | NA | NA | 26 | false | false |
| 1:771823:T:C | NA | NA | NA | NA | 26 | false | false |
| 1:771967:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:772755:A:C | NA | NA | NA | NA | 26 | false | false |
| 1:772927:C:T | NA | NA | NA | NA | 26 | false | false |
| 1:773106:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:773124:T:C | NA | NA | NA | NA | 26 | false | false |

It took almost three days to run, so I need to find out why before trying again. The phenotype is coded as 0/1

Hail: WARN: 154 of 7135 samples have a missing phenotype or covariate.
2017-10-23 19:05:17 Hail: INFO: Running wald logistic regression on 6981 samples with 9 covariates including intercept…

so 6981 samples have neither any missing covariate nor a missing phenotype. Any idea?

tpoterba · November 2, 2017, 2:03pm

What do the genotypes look like?

The best way to debug is probably to filter_variants_list(Variant.parse('1:768448:G:A')) before rerunning logreg.
That should take a second, not two days (ouch!)

tpoterba · November 2, 2017, 2:04pm

The converged column is false for every value, there. That could be the problem. Summoning @jbloom!

Stephane_Bourgeois · November 2, 2017, 2:13pm

test = gwas.filter_variants_list([Variant.parse(‘1:768448:G:A’)]).logreg(‘wald’, ‘sa.pheno.E’, covariates=[‘sa.pheno.age’, ‘sa.pheno.weight’, ‘sa.pheno.sex’, ‘sa.pca.PC1’, ‘sa.pca.PC2’, ‘sa.pca.PC3’, ‘sa.pca.PC4’, ‘sa.pca.PC5’], root=‘va.test’, use_dosages=True)

test.variants_table().annotate(‘logreg = va.test’).select([‘v’, ‘logreg’]).show(25)

±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| v | logreg.beta | logreg.se | logreg.zstat | logreg.pval | logreg.fit.nIter | logreg.fit.converged | logreg.fit.exploded |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| Variant | Double | Double | Double | Double | Int | Boolean | Boolean |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| 1:768448:G:A | NA | NA | NA | NA | 26 | false | false |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+

Same thing… is there a way to check the genotypes? They were imported from imputed vcfs.

And ouch indeed… I’m using 8 cores and 64G of RAM, the next step will be UKBiobank, so I’m wondering what kind of resources I will need to do that in a decent time, like a week.

Stephane_Bourgeois · November 2, 2017, 2:31pm

Genotypes were imputed at Sanger. According to the website:

Genotype or haplotype specific metrics are encoded in the FORMAT columns of the VCF. Depending on the pipeline choice, these may include:

FORMAT/GT: the genotype or haplotype call. Original genotype calls will not be altered.
FORMAT/ADS: the per-haplotype allele dosage (discarded in SHAPEIT2+PBWT pipeline as it can be used to reconstruct the phase).
FORMAT/DS: genotype dosage (calculated from FORMAT/ADS).
FORMAT/GP: the posterior genotype probabilities (calculated from FORMAT/ADS).

Here is how it looks like in my files:

22 16050435 . T C . PASS RefPanelAF=0.000323375;AN=14270;AC=0;INFO=1.24985e-09 GT:ADS:DS:GP 0|0:0.05,0.05:0.1:0.9025,0.095,0.0025 0|0:0.05,0.05:0.1:0.9025,0.095,0.0025

Does Hail use the GP field when use_dosages=True ?

tpoterba · November 2, 2017, 2:41pm

Ah, this explains it. Hail doesn’t really support any VCF fields but GT, AD, DP, GQ, PL. This is changing in 0.2!

To peek at the genotypes, you can do something like

vds.genotypes_table().select(['v', 's', 'g']).show()

I assume that this will show missing values or hard calls.

Note that Hail do support BGEN files. I’m not sure how long logistic regression will take on 8 cores, though – a week may be a good estimate.

Stephane_Bourgeois · November 2, 2017, 2:49pm

It seems to show hard calls only:

test.genotypes_table().select([‘v’, ‘s’, ‘g’]).show()
±-------------±----------±---------------+
| v | s | g |
±-------------±----------±---------------+
| Variant | String | Genotype |
±-------------±----------±---------------+
| 1:768448:G:A | TG1100002 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100021 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100031 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100041 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100062 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100082 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100092 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100101 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100111 | 0/1:.:.:.:PL=. |
| 1:768448:G:A | TG1100121 | 0/0:.:.:.:PL=. |
±-------------±----------±---------------+
showing top 10 rows

So, if I transform my vcfs into bgen and then import into Hail, that should work?

tpoterba · November 2, 2017, 2:50pm

Yes, it should! @jbloom why did it take so long to run? I’d have thought that it would have short-circuited in the entirely-missing case.

Stephane_Bourgeois · November 2, 2017, 4:47pm

bcftools converts into .gen.gz, will I need to uncompress the files or can import_gen deal with gen.gz format?

tpoterba · November 2, 2017, 4:49pm

if that .gz is actually a block gzipped file, rename the file to .bgz and try that. Otherwise, you shouldn’t use the .gz – while Hail can read it, normal gzipped files are not splittable and so all of the work will happen in one Spark task, probably OOMing somewhere and losing all fault tolerance and parallelism.

jbloom · November 2, 2017, 5:27pm

In 0.1, if all hardcalls are missing, it short circuits. But if all dosages are missing, you get a vector of NA, which I think is resulting in the math going through but the convergence criteria testing Double.NaN < tol which is false. In other words, instead of taking, say, 4 iterations, it runs to the limit of 25 iterations and gives up. Hopefully that means it will run faster on non-missing data.

I’ll update here on whether this is what’s actually happening, and if so, I’ll fix so that in the future any NaN in genotype causes a short circuit.

Topic		Replies	Views
PheWAS on DNAnexus UKB RAP Hail Query & hailctl	13	855	December 21, 2022
GWAS with imputed data Hail Query & hailctl	11	990	March 10, 2020
Error summary: HailException Hail Query & hailctl	4	546	September 26, 2022
Missing value and logistic regression Hail Query & hailctl	5	790	October 2, 2020
Logistic regression on remote servers Hail Query & hailctl	1	432	October 14, 2020

Empty logistic regression

Related topics