Empty logistic regression

Hi,

I did a logistic regression as follow:

gwas = vds.logreg('wald', 'sa.pheno.E', covariates=['sa.pheno.age', 'sa.pheno.weight', 'sa.pheno.sex', 'sa.pca.PC1', 'sa.pca.PC2', 'sa.pca.PC3', 'sa.pca.PC4', 'sa.pca.PC5'], root='va.logreg', use_dosages=True)
gwas.write('/mnt/volume/regressions/E.regression1.vds')


2017-10-23 18:14:42 Hail: INFO: Reading table to impute column types
2017-10-23 18:14:42 Hail: INFO: Finished type imputation
Loading column SID' as type String (imputed) Loading columnPC1’ as type Double (imputed)
Loading column PC2' as type Double (imputed) Loading columnPC3’ as type Double (imputed)
Loading column PC4' as type Double (imputed) Loading columnPC5’ as type Double (imputed)
[Stage 10:===================================================>(1794 + 1) / 1795]2017-10-23 19:05:17 Hail: WARN: 154 of 7135 samples have a missing phenotype or covariate.
2017-10-23 19:05:17 Hail: INFO: Running wald logistic regression on 6981 samples with 9 covariates including intercept…
[Stage 11:===================================================>(1794 + 1) / 1795]Read/write took: 7:27:35.357489
GWAS took: 2 days, 19:36:55.387864

gwas = hc.read('/mnt/volume/regressions/E.regression1.vds')
gwas.filter_variants_expr('isnan(va.logreg.pval)', keep=False).export_variants('/mnt/volume/plots/E.regression1.4manhattan', 'CHR = v.contig, POS = v.start, P = va.logreg.pval')

But the E.regression1.4manhattan only contains the header. What could have happened here? What should I check?

What happens if you just do:

print(gwas.filter_variants_expr('isnan(va.logreg.pval)', keep=False).count())

Could be an export problem, an issue with isnan (I remember this has been a problem in the past) or something else entirely, so this will help us figure that out.

Humm, this seems empty:

print(gwas.filter_variants_expr(‘isnan(va.logreg.pval)’, keep=False).count())
[Stage 19:===================================================>(1784 + 8) / 1795](7135L, 0L)

what about pre-filter?

Next thing I’d do is

gwas.variants_table().annotate('logreg = va.logreg')\
    .select(['v', 'logreg'])\
    .show(25)

just to see if all the pvals are actually nan.

Only NAs:

gwas.variants_table().annotate(‘logreg = va.logreg’).select([‘v’, ‘logreg’]).show(25)
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| v | logreg.beta | logreg.se | logreg.zstat | logreg.pval | logreg.fit.nIter | logreg.fit.converged | logreg.fit.exploded |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| Variant | Double | Double | Double | Double | Int | Boolean | Boolean |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| 1:768448:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:768819:C:T | NA | NA | NA | NA | 26 | false | false |
| 1:769223:C:G | NA | NA | NA | NA | 26 | false | false |
| 1:769885:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:770085:G:C | NA | NA | NA | NA | 26 | false | false |
| 1:770181:A:G | NA | NA | NA | NA | 26 | false | false |
| 1:770377:A:T | NA | NA | NA | NA | 26 | false | false |
| 1:771823:T:C | NA | NA | NA | NA | 26 | false | false |
| 1:771967:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:772755:A:C | NA | NA | NA | NA | 26 | false | false |
| 1:772927:C:T | NA | NA | NA | NA | 26 | false | false |
| 1:773106:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:773124:T:C | NA | NA | NA | NA | 26 | false | false |

It took almost three days to run, so I need to find out why before trying again. The phenotype is coded as 0/1

Hail: WARN: 154 of 7135 samples have a missing phenotype or covariate.
2017-10-23 19:05:17 Hail: INFO: Running wald logistic regression on 6981 samples with 9 covariates including intercept…

so 6981 samples have neither any missing covariate nor a missing phenotype. Any idea?

What do the genotypes look like?

The best way to debug is probably to filter_variants_list(Variant.parse('1:768448:G:A')) before rerunning logreg.
That should take a second, not two days (ouch!)

The converged column is false for every value, there. That could be the problem. Summoning @jbloom!

test = gwas.filter_variants_list([Variant.parse(‘1:768448:G:A’)]).logreg(‘wald’, ‘sa.pheno.E’, covariates=[‘sa.pheno.age’, ‘sa.pheno.weight’, ‘sa.pheno.sex’, ‘sa.pca.PC1’, ‘sa.pca.PC2’, ‘sa.pca.PC3’, ‘sa.pca.PC4’, ‘sa.pca.PC5’], root=‘va.test’, use_dosages=True)

test.variants_table().annotate(‘logreg = va.test’).select([‘v’, ‘logreg’]).show(25)

±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| v | logreg.beta | logreg.se | logreg.zstat | logreg.pval | logreg.fit.nIter | logreg.fit.converged | logreg.fit.exploded |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| Variant | Double | Double | Double | Double | Int | Boolean | Boolean |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| 1:768448:G:A | NA | NA | NA | NA | 26 | false | false |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+

Same thing… is there a way to check the genotypes? They were imported from imputed vcfs.

And ouch indeed… I’m using 8 cores and 64G of RAM, the next step will be UKBiobank, so I’m wondering what kind of resources I will need to do that in a decent time, like a week.

Genotypes were imputed at Sanger. According to the website:

Genotype or haplotype specific metrics are encoded in the FORMAT columns of the VCF. Depending on the pipeline choice, these may include:

FORMAT/GT: the genotype or haplotype call. Original genotype calls will not be altered.
FORMAT/ADS: the per-haplotype allele dosage (discarded in SHAPEIT2+PBWT pipeline as it can be used to reconstruct the phase).
FORMAT/DS: genotype dosage (calculated from FORMAT/ADS).
FORMAT/GP: the posterior genotype probabilities (calculated from FORMAT/ADS).

Here is how it looks like in my files:

22 16050435 . T C . PASS RefPanelAF=0.000323375;AN=14270;AC=0;INFO=1.24985e-09 GT:ADS:DS:GP 0|0:0.05,0.05:0.1:0.9025,0.095,0.0025 0|0:0.05,0.05:0.1:0.9025,0.095,0.0025

Does Hail use the GP field when use_dosages=True ?

Ah, this explains it. Hail doesn’t really support any VCF fields but GT, AD, DP, GQ, PL. This is changing in 0.2!

To peek at the genotypes, you can do something like

vds.genotypes_table().select(['v', 's', 'g']).show()

I assume that this will show missing values or hard calls.

Note that Hail do support BGEN files. I’m not sure how long logistic regression will take on 8 cores, though – a week may be a good estimate.

1 Like

It seems to show hard calls only:

test.genotypes_table().select([‘v’, ‘s’, ‘g’]).show()
±-------------±----------±---------------+
| v | s | g |
±-------------±----------±---------------+
| Variant | String | Genotype |
±-------------±----------±---------------+
| 1:768448:G:A | TG1100002 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100021 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100031 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100041 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100062 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100082 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100092 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100101 | 0/0:.:.:.:PL=. |
| 1:768448:G:A | TG1100111 | 0/1:.:.:.:PL=. |
| 1:768448:G:A | TG1100121 | 0/0:.:.:.:PL=. |
±-------------±----------±---------------+
showing top 10 rows

So, if I transform my vcfs into bgen and then import into Hail, that should work?

Yes, it should! @jbloom why did it take so long to run? I’d have thought that it would have short-circuited in the entirely-missing case.

bcftools converts into .gen.gz, will I need to uncompress the files or can import_gen deal with gen.gz format?

if that .gz is actually a block gzipped file, rename the file to .bgz and try that. Otherwise, you shouldn’t use the .gz – while Hail can read it, normal gzipped files are not splittable and so all of the work will happen in one Spark task, probably OOMing somewhere and losing all fault tolerance and parallelism.

In 0.1, if all hardcalls are missing, it short circuits. But if all dosages are missing, you get a vector of NA, which I think is resulting in the math going through but the convergence criteria testing Double.NaN < tol which is false. In other words, instead of taking, say, 4 iterations, it runs to the limit of 25 iterations and gives up. Hopefully that means it will run faster on non-missing data.

I’ll update here on whether this is what’s actually happening, and if so, I’ll fix so that in the future any NaN in genotype causes a short circuit.