Missing value and logistic regression

I’m trying to use Hail to run a logistic regression GWAS on a case-control(0 vs 1) phenotype. Here are my codes:

import hail as hl
mt = hl.import_plink(bed=‘chr22.bed’,bim=‘chr22.bim’,fam=‘chr22.fam’,missing=’.’,quant_pheno=False)
covar = (hl.import_table(‘pheno_covariate.txt’,types={‘IID’:hl.tstr,‘mypheno’:hl.tfloat64},impute=True,missing=’.’).key_by(‘IID’))
mt = mt.annotate_cols(covar=covar[mt.s])
mt_logistic = hl.logistic_regression_rows(test=‘wald’,y=mt.covar.mypheno,x=mt.GT.n_alt_alleles(),

I have two questions:

  1. How will Hail deal with the missing value in import_table() function? Will the missing value be deleted or imputed in some way?

  2. When I ran this code, I got an error “OutOfMemoryError: GC overhead limit exceed”. Are there any mistakes in my code?

Thanks so much!

Hi @Stephen,

You probably need to increase the memory available to Spark, a library Hail uses.

Hail treats missing data as missing. If you execute,


you’ll see the missing data represented by a special value NA.

If you are curious how logistic_regression_rows deals with missing y-values, check out the docs for logistic_regression_rows, specifically the warning box.

Hi @danking ,

Thank you for your reply. If I use Hail by runing .py file on a linux server, where should I add this command? Is it right to add it to the first line of my .py file?

export PYSPARK_SUBMIT_ARGS="–driver-memory 8g --executor-memory 8g pyspark-shell"

That code sets an environment variable in a shell like bash or zsh. You have to run that l, in your shell, before you run Python or pyspark or spark-submit.

Hi @danking,

Thank you for your reply. Now I have another error:

Traceback (most recent call last):
File “/z/Comp/logi/1.py”, line 3, in
File “”, line 2, in init
File “/ua/zwang2547/.local/lib/python3.6/site-packages/hail/typecheck/check.py”, line 614, in wrapper
return original_func(*args, **kwargs)
File “/ua/zwang2547/.local/lib/python3.6/site-packages/hail/context.py”, line 231, in init
skip_logging_configuration, optimizer_iterations)
File “/ua/zwang2547/.local/lib/python3.6/site-packages/hail/backend/spark_backend.py”, line 165, in init
File “/ua/zwang2547/.local/lib/python3.6/site-packages/pyspark/context.py”, line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File “/ua/zwang2547/.local/lib/python3.6/site-packages/pyspark/java_gateway.py”, line 46, in launch_gateway
return _launch_gateway(conf)
File “/ua/zwang2547/.local/lib/python3.6/site-packages/pyspark/java_gateway.py”, line 108, in _launch_gateway
raise Exception(“Java gateway process exited before sending its port number”)
Exception: Java gateway process exited before sending its port number

Do you know how to fix it?

Look for a hail log file. There’s more detail there.
This almost certainly means you have an error in PYSPARK_SUBMIT_ARGS. Make sure it’s exactly as described at the other post and make sure you have enough memory.