Missing value and logistic regression

Hello!
I’m trying to use Hail to run a logistic regression GWAS on a case-control(0 vs 1) phenotype. Here are my codes:

import hail as hl
hl.init()
mt = hl.import_plink(bed=‘chr22.bed’,bim=‘chr22.bim’,fam=‘chr22.fam’,missing=‘.’,quant_pheno=False)
covar = (hl.import_table(‘pheno_covariate.txt’,types={‘IID’:hl.tstr,‘mypheno’:hl.tfloat64},impute=True,missing=‘.’).key_by(‘IID’))
mt = mt.annotate_cols(covar=covar[mt.s])
mt_logistic = hl.logistic_regression_rows(test=‘wald’,y=mt.covar.mypheno,x=mt.GT.n_alt_alleles(),
covariates=[1,mt.covar.IsMale,mt.covar.PC1,mt.covar.PC2,mt.covar.PC3,mt.covar.PC4,mt.covar.PC5,mt.covar.PC6,mt.covar.PC7,mt.covar.PC8,mt.covar.PC9,mt.covar.PC10])
mt_logistic.export(‘chr22_hail.txt’)

I have two questions:

  1. How will Hail deal with the missing value in import_table() function? Will the missing value be deleted or imputed in some way?

  2. When I ran this code, I got an error “OutOfMemoryError: GC overhead limit exceed”. Are there any mistakes in my code?

Thanks so much!

Hi @Stephen,

You probably need to increase the memory available to Spark, a library Hail uses.

Hail treats missing data as missing. If you execute,

covar.filter(hl.is_missing(cover.mypheno)).show()

you’ll see the missing data represented by a special value NA.

If you are curious how logistic_regression_rows deals with missing y-values, check out the docs for logistic_regression_rows, specifically the warning box.

Hi @danking ,

Thank you for your reply. If I use Hail by runing .py file on a linux server, where should I add this command? Is it right to add it to the first line of my .py file?

export PYSPARK_SUBMIT_ARGS=“–driver-memory 8g --executor-memory 8g pyspark-shell”

That code sets an environment variable in a shell like bash or zsh. You have to run that l, in your shell, before you run Python or pyspark or spark-submit.

Hi @danking,

Thank you for your reply. Now I have another error:

Traceback (most recent call last):
File “/z/Comp/logi/1.py”, line 3, in
hl.init()
File “”, line 2, in init
File “/ua/zwang2547/.local/lib/python3.6/site-packages/hail/typecheck/check.py”, line 614, in wrapper
return original_func(*args, **kwargs)
File “/ua/zwang2547/.local/lib/python3.6/site-packages/hail/context.py”, line 231, in init
skip_logging_configuration, optimizer_iterations)
File “/ua/zwang2547/.local/lib/python3.6/site-packages/hail/backend/spark_backend.py”, line 165, in init
pyspark.SparkContext._ensure_initialized(conf=conf)
File “/ua/zwang2547/.local/lib/python3.6/site-packages/pyspark/context.py”, line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File “/ua/zwang2547/.local/lib/python3.6/site-packages/pyspark/java_gateway.py”, line 46, in launch_gateway
return _launch_gateway(conf)
File “/ua/zwang2547/.local/lib/python3.6/site-packages/pyspark/java_gateway.py”, line 108, in _launch_gateway
raise Exception(“Java gateway process exited before sending its port number”)
Exception: Java gateway process exited before sending its port number

Do you know how to fix it?

Look for a hail log file. There’s more detail there.
This almost certainly means you have an error in PYSPARK_SUBMIT_ARGS. Make sure it’s exactly as described at the other post and make sure you have enough memory.