Could be an export problem, an issue with isnan (I remember this has been a problem in the past) or something else entirely, so this will help us figure that out.
gwas.variants_table().annotate(‘logreg = va.logreg’).select([‘v’, ‘logreg’]).show(25)
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| v | logreg.beta | logreg.se | logreg.zstat | logreg.pval | logreg.fit.nIter | logreg.fit.converged | logreg.fit.exploded |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| Variant | Double | Double | Double | Double | Int | Boolean | Boolean |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| 1:768448:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:768819:C:T | NA | NA | NA | NA | 26 | false | false |
| 1:769223:C:G | NA | NA | NA | NA | 26 | false | false |
| 1:769885:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:770085:G:C | NA | NA | NA | NA | 26 | false | false |
| 1:770181:A:G | NA | NA | NA | NA | 26 | false | false |
| 1:770377:A:T | NA | NA | NA | NA | 26 | false | false |
| 1:771823:T:C | NA | NA | NA | NA | 26 | false | false |
| 1:771967:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:772755:A:C | NA | NA | NA | NA | 26 | false | false |
| 1:772927:C:T | NA | NA | NA | NA | 26 | false | false |
| 1:773106:G:A | NA | NA | NA | NA | 26 | false | false |
| 1:773124:T:C | NA | NA | NA | NA | 26 | false | false |
It took almost three days to run, so I need to find out why before trying again. The phenotype is coded as 0/1
Hail: WARN: 154 of 7135 samples have a missing phenotype or covariate.
2017-10-23 19:05:17 Hail: INFO: Running wald logistic regression on 6981 samples with 9 covariates including intercept…
so 6981 samples have neither any missing covariate nor a missing phenotype. Any idea?
The best way to debug is probably to filter_variants_list(Variant.parse('1:768448:G:A')) before rerunning logreg.
That should take a second, not two days (ouch!)
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| v | logreg.beta | logreg.se | logreg.zstat | logreg.pval | logreg.fit.nIter | logreg.fit.converged | logreg.fit.exploded |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| Variant | Double | Double | Double | Double | Int | Boolean | Boolean |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
| 1:768448:G:A | NA | NA | NA | NA | 26 | false | false |
±-------------±------------±----------±-------------±------------±-----------------±---------------------±--------------------+
Same thing… is there a way to check the genotypes? They were imported from imputed vcfs.
And ouch indeed… I’m using 8 cores and 64G of RAM, the next step will be UKBiobank, so I’m wondering what kind of resources I will need to do that in a decent time, like a week.
Genotypes were imputed at Sanger. According to the website:
Genotype or haplotype specific metrics are encoded in the FORMAT columns of the VCF. Depending on the pipeline choice, these may include:
FORMAT/GT: the genotype or haplotype call. Original genotype calls will not be altered.
FORMAT/ADS: the per-haplotype allele dosage (discarded in SHAPEIT2+PBWT pipeline as it can be used to reconstruct the phase).
FORMAT/DS: genotype dosage (calculated from FORMAT/ADS).
FORMAT/GP: the posterior genotype probabilities (calculated from FORMAT/ADS).
Here is how it looks like in my files:
22 16050435 . T C . PASS RefPanelAF=0.000323375;AN=14270;AC=0;INFO=1.24985e-09 GT:ADS:DS:GP 0|0:0.05,0.05:0.1:0.9025,0.095,0.0025 0|0:0.05,0.05:0.1:0.9025,0.095,0.0025
Does Hail use the GP field when use_dosages=True ?
if that .gz is actually a block gzipped file, rename the file to .bgz and try that. Otherwise, you shouldn’t use the .gz – while Hail can read it, normal gzipped files are not splittable and so all of the work will happen in one Spark task, probably OOMing somewhere and losing all fault tolerance and parallelism.
In 0.1, if all hardcalls are missing, it short circuits. But if all dosages are missing, you get a vector of NA, which I think is resulting in the math going through but the convergence criteria testing Double.NaN < tol which is false. In other words, instead of taking, say, 4 iterations, it runs to the limit of 25 iterations and gives up. Hopefully that means it will run faster on non-missing data.
I’ll update here on whether this is what’s actually happening, and if so, I’ll fix so that in the future any NaN in genotype causes a short circuit.