BGEN filter_cols memory leak

Hi!
I am trying to run GWAS using multiple cores on the broad hpc. I have had success following instructions from this forum and have seen similar direction for using many cores on the google cloud, but am hoping to run this on prem and am not sure if this is possible, or if I am making a simple mistake.

I believe the relevant errors are one of two, either
[Stage 3:> (0 + 4) / 293]BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
or the immediately following
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00002b4dc2e41928, pid=8835, tid=0x00002b4da530d700
#
# JRE version: Java™ SE Runtime Environment (8.0_181-b13) (build 1.8.0_181-b13)
# Java VM: Java HotSpot™ 64-Bit Server VM (25.181-b13 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [libblas.so.3+0x2e0928] dgemm_incopy+0x88
#
#
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.

My setup is as follows:
ish -l h_vmem=8G -pe smp 4 -binding linear:4 (I have also tried submitting this as a job with more memory, but I don’t believe I’ve requested more than 8G per core successfully yet)

use Anaconda3
use Java-1.8
## use -v OpenBLAS ## I’ve tried with and without
use .openblas_for_hail-0.2.20
use -v .mkl-2019.3.199
## conda create -n hail python==3.7 ## already run
source activate hail
conda install pip
pip install -U hail

export PYSPARK_SUBMIT_ARGS="–driver-memory 8G pyspark-shell --executor-memory 8G" ## and with 4G as well. Without this line, I run out of memory a la “java.lang.OutOfMemoryError: Java heap space, in Stage 2.”

Thank you for your help!

Hey @ebartell !

Sorry to hear you’re having trouble with the Broad cluster. I just ran this and it worked for me. Can you share what you’re doing differently?

use UGER
ish -l os=RedHat7
use Anaconda3
use Java-1.8
use .openblas_for_hail-0.2.20
yes | conda env remove -n hail
yes | conda create -n hail python==3.7
conda activate hail
yes | pip3 install ipython hail
python3 -c '
import hail as hl
mt = hl.balding_nichols_model(3, n_variants=1000, n_samples=10)
t = hl.linear_regression_rows(x=mt.GT.n_alt_alleles(), y=mt.pop, covariates=[1])
t.show()'

After running the above sequence of commands at least once, this should suffice:

use UGER
ish -l os=RedHat7
use Anaconda3
use Java-1.8
use .openblas_for_hail-0.2.20
conda activate hail
python3 -c '
import hail as hl
mt = hl.balding_nichols_model(3, n_variants=1000, n_samples=10)
t = hl.linear_regression_rows(x=mt.GT.n_alt_alleles(), y=mt.pop, covariates=[1])
t.show()'

Thank you for your help.
I started an interactive session with 2 cores and 4G memory, and running your code works. Certainly I am not running
yes | conda env remove -n hail
every time, and likely I had an older version of the hail library installed; this could be the fix I was missing.
When I try my python script, I need the additional
export PYSPARK_SUBMIT_ARGS="–driver-memory 4G pyspark-shell --executor-memory 4G"
or mine dies on memory; it is currently running and seems to have already passed its usual break point. I can update here if the errors I was seeing reoccur.
Thank you!

1 Like

I don’t believe it needs it, but should I include source activate hail after creating the environment but prior to the pip3 install? When I do, in the future all I seem to need is the source activate hail (and the use lines) to run GWAS.
Thanks!

Ah, yes, I think that’s right. I’m pretty sure anaconda also now recommends conda activate ...

1 Like

Thank you for your help, and apologies to re-open this, I am running into a new error.

I am able to have this run

python3 -c '
import hail as hl
mt = hl.balding_nichols_model(3, n_variants=1000, n_samples=10)
t = hl.linear_regression_rows(x=mt.GT.n_alt_alleles(), y=mt.pop, covariates=[1])
t.show()'

But when I run my larger script, I come upon a new error (progress!):

ERROR:root:Exception while sending command.
...    
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
...
py4j.protocol.Py4JError: An error occurred while calling o1.pyPersistTable
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (IP address:port)

In order for this to happen, I first launch an interactive session with 2 cores.
ish -pe smp 2 -binding linear:2 -l h_vmem=8G
and then

use Anaconda3
use Java-1.8
use .openblas_for_hail-0.2.20
yes | conda env remove -n hail
yes | conda create -n hail python==3.7
source activate hail
# yes | pip3 install matplotlib
yes | pip3 install ipython hail

python3 -c '
import hail as hl
mt = hl.balding_nichols_model(3, n_variants=1000, n_samples=10)
t = hl.linear_regression_rows(x=mt.GT.n_alt_alleles(), y=mt.pop, covariates=[1])
t.show()'

export PYSPARK_SUBMIT_ARGS="--driver-memory 8G pyspark-shell --executor-memory 8G"

python z030_script_UKB_runHail.py --chrom ${SGE_TASK_ID} --analysisName ${filename}

I am happy to share my hail python script if it will help.

Hey @ebartell ,

Sorry you’re running into trouble again! Do you have the hail .log file generated by the failing run? Can you share that here or upload it to gist.github.com and share that here?

Thanks again for your help. I’ve uploaded it here:

Can you share your python script?

Yes, and thank you for your help! This is the full script, including pieces unrelated to Hail.


print("Running gwas. using HAIL")


print("importing libraries")
import hail as hl
hl.init()

import pandas as pd
#import matplotlib as mpl
import numpy as np
# mpl.use('Agg')
# from matplotlib import pyplot as plt
import scipy.stats as ss
import sys
import os

acceptableSampleSizes = ["all","halfGwas1","halfGwas2","ckbN_halfGwas1","ckbN_halfGwas2","Q1_CKB-N-.5unrel","Q2_CKB-N-.5unrel","Q3_CKB-N-.5unrel","Q4_CKB-N-.5unrel",
                        "1of10unrel","2of10unrel","3of10unrel","4of10unrel"] + [str(x) for x in range(1,51)]

import argparse
parser = argparse.ArgumentParser(description='Run sibship for given list of SNPs (or summstats file).')
# parser.add_argument('--lociFile', dest='lociFile', type=str,
#                    help='File location for SNP   chr list of SNPs (tab separated). If none provided, will run on all SNPs.',
#                    required=False, default="allSnps")
parser.add_argument('--analysisName', dest='analysisName', type=str,
                   help='Analysis name, used to name output.',required=True)
parser.add_argument('--chrom', dest='chrom', type=int,required=True,
                   help='Chrom of analysis (in UKB). Required, used to parallelize with multiple jobs over chroms.')
parser.add_argument('--relatedness', metavar='str', dest='relatedness', type=str, action='store', choices=["all","unrelated","WB"], default="WB",
                   help="Define which set of individuals to use, unrelateds or all. (default: WB for white british).")
parser.add_argument('--sampleSize', metavar='str', dest='sampleSize', type=str, action='store', choices=acceptableSampleSizes, default="all",
                   help="Define which sample size to use, all or predefined subset. (default: all).")
# parser.add_argument('--out', dest='out', type=str,
#                    help='Output location; UNUSED.',required=True)

# parser.add_argument('--lociFileRsidCol', dest='lociFileRsidCol', type=str,
#                    help='RSID column (to ID snps for analysis). default SNP',required=False,default="SNP")
# parser.add_argument('--lociFileChromCol', dest='lociFileChromCol', type=str,
#                    help='Chrom column (used to subset loci file). default CHR',required=False,default="CHR")
# parser.add_argument('--lociFileEffectAlleleCol', dest='lociFileEffectAlleleCol', type=str,
#                    help='Effect allele column (used to match alleles). default ALLELE1',required=False,default="ALLELE1")
# parser.add_argument('--summStatsDelim', metavar='str', dest='summStatsDelim', type=str, action='store', choices=["tab","comma","space","\\s"," ",",","\\t","\t"], default="tab",
#                    help="Use this delimiter when parsing summStats (default: tab).")

parser.add_argument('--verbose', '-v', action='count', default=0)
parser.add_argument('--quiet', '-q', action='count', default=0)

# parser.add_argument('integers', metavar='N', type=int, nargs='+',
#                    help='an integer for the accumulator')

# inArgs = '--chrom 22 --analysisName boltLMM_unrelateds_noPCsLinreg --lociFile ../ht_shr_leg_upperbody/z040_out_bgen_Height_all_unrel_noPCs_filtered_v2_prunedByDist.tsv'.split(" ")

inArgs = '--chrom 22 --analysisName Height_all --relatedness WB --sampleSize all'.split(" ") ##################
args = parser.parse_args(inArgs) ############################### testing ##################################


args = parser.parse_args()
print("args: %s " % args)


chrom = args.chrom
# prunedHeightSnps = pd.read_csv("../ht_shr_leg_upperbody/z040_out_bgen_Height_all_filtered_v2_prunedByDist.tsv",sep="\t")
# inputSSname = "UKB_all"
# prunedHeightSnps = pd.read_csv("/cvar/jhlab/ebartell/Public_GWAS_Summstats/sig_wood_height_pruned_loci_checkTurchinAnalysis.tsv",sep="\t")
# inputSSname = "Wood2014"
# if args.lociFile != "allSnps":
#     print("loading summstats.")
#     prunedHeightSnps = pd.read_csv(args.lociFile,sep="\t")
#     inputSSname = args.analysisName
#     if args.lociFileRsidCol != "SNP":
#         prunedHeightSnps["SNP"] = prunedHeightSnps[args.lociFileRsidCol]
#     if args.lociFileChromCol != "CHR":
#         prunedHeightSnps["CHR"] = prunedHeightSnps[args.lociFileChromCol]
#     if args.lociFileEffectAlleleCol != "ALLELE1":
#         prunedHeightSnps["ALLELE1"] = prunedHeightSnps[args.lociFileEffectAlleleCol]
#     prunedHeightSnps_thisChrom = prunedHeightSnps[prunedHeightSnps.CHR == chrom]
#     prunedHeightSnps_thisChrom_rsids = list(prunedHeightSnps_thisChrom.SNP)
#     print("%s gws variants." % len(prunedHeightSnps_thisChrom_rsids))


print("loading bgen.")
indexMapper = {}
contigRecoding = {}
chromString = "{"
for i in range(1,23):
    indexMapper["/broad/ukbb/imputed_v3/ukb_imp_chr%s_v3.bgen" % i] = "/broad/jhlab_ukbiobank/users/ebartell/qfam/subsetBgens/ukb_imp_chr%s_v3_hail.bgen.idx2" % i
    if i < 10:
        contigRecoding["0%s" % i] = str(i)
        # contigRecoding["%s" % i] = "0" + str(i)
    else:
        contigRecoding["%s" % i] = str(i)
    chromString += "%s," % str(i)

chromString = chromString[:-1] + "}"
### very slow, only runs once. DONE ALREADY
# bgen_idx = hl.index_bgen("/broad/ukbb/imputed_v3/ukb_imp_chr%s_v3.bgen" % chromString, indexMapper,contig_recoding=contigRecoding)


### very fast
bgen = hl.import_bgen("/broad/ukbb/imputed_v3/ukb_imp_chr%s_v3.bgen" % chrom,
                            entry_fields=['GT', 'GP','dosage'], sample_file="/broad/jhlab_ukbiobank/PhenoData/LinkIds_ukb_500K_ImpData_v3_app11898_chrAUT.sample",
                            index_file_map=indexMapper)

print("load phenotypes")
#Load phenotype file, thru hail and also normally in python (easier to use, cuz I don't yet fully know all hail fxns)
phenoFile = "/broad/jhlab_ukbiobank/users/aravikum/prs_sim/prsice_outputs/" + args.analysisName
phenos = pd.read_csv(phenoFile,sep="\t")

colTypes = {"FID": hl.tstr,"IID":hl.tstr,"Noise_added":hl.tfloat64}
phenosHailTable = hl.import_table(phenoFile, impute=True, 
    types=colTypes).key_by('IID')


print("load covariates (including PCs)")
covariatesFile = "/broad/jhlab_ukbiobank/users/ebartell/z020_out_covariates.txt"
colTypesCov = {"FID": hl.tstr,"IID":hl.tstr,"ArrayNum":hl.tint32}
for i in range(1,11):
    colTypesCov["PC%s" % i] = hl.tfloat64

covariatesHailTable = hl.import_table(covariatesFile, 
    types=colTypesCov).key_by('IID')


###################### Load dropped samples and drop them HERE
# print("load sample exclusions and drop them UNIMPLIMENTED")
print("loading individuals always removed.")

individualsToRemove = set()

drop1 = pd.read_csv("/broad/jhlab_ukbiobank/users/jcole/Pheno/bolt.in_plink_but_not_imputed.FID_IID.968.txt",sep=" ",header=None)
individualsToRemove |= set([str(x) for x in drop1[0]])
if args.relatedness in ["WB","unrelated"]:
    drop2 = pd.read_csv("/broad/jhlab_ukbiobank/users/ebartell/bolt.in_v3_sample_but_not_UKBEURN455K.N32265_obtained3.12.19.txt",sep=" ",header=None)
    individualsToRemove |= set([str(x) for x in drop2[0]])

if args.relatedness == "unrelated":
    drop3 = pd.read_csv("/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/samplesNotUnrelated_remove.csv",sep=" ",header=None)
    individualsToRemove |= set([str(x) for x in drop3[0]])

if args.sampleSize != "all":
    print("sample size argument %s is not implemented yet. running for all individuals.")
    subsetSamplesFileRemove = ""
    if args.sampleSize == "halfGwas1":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove_firstHalfComplement.csv"
    elif args.sampleSize == "halfGwas2":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove_secondHalfComplement.csv"
    elif args.sampleSize == "ckbN_halfGwas1":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove_ckb1_complement.csv"
    elif args.sampleSize == "ckbN_halfGwas2":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove_ckb2_complement.csv"
    elif args.sampleSize == "Q1_CKB-N-.5unrel":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove-complement_CKB-N.5_1Of4Pieces.csv"
    elif args.sampleSize == "Q2_CKB-N-.5unrel":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove-complement_CKB-N.5_2Of4Pieces.csv"
    elif args.sampleSize == "Q3_CKB-N-.5unrel":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove-complement_CKB-N.5_3Of4Pieces.csv"
    elif args.sampleSize == "Q4_CKB-N-.5unrel":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove-complement_CKB-N.5_4Of4Pieces.csv"
    elif args.sampleSize == "1of10unrel":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove-complement_1Of10Pieces.csv"
    elif args.sampleSize == "2of10unrel":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove-complement_2Of10Pieces.csv"
    elif args.sampleSize == "3of10unrel":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove-complement_3Of10Pieces.csv"
    elif args.sampleSize == "4of10unrel":
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/included_individuals_unrel-sample-fam-remove-complement_4Of10Pieces.csv"
    elif args.sampleSize in set([str(x) in range(1,51)]): ##### 50 subsets
        subsetSamplesFileRemove = "/broad/jhlab_ukbiobank/users/ebartell/ht_shr_leg_upperbody/subsetGwas/subsetSamples/included_individuals_unrel-sample-fam-remove-complement_%sOf50Pieces.csv" % args.sampleSize 
        analysisDirec="unrel50/"
    else:
        print("sample size argument %s is not implemented yet. running for all individuals.")
    if subsetSamplesFileRemove != "":
        drop3 = pd.read_csv(subsetSamplesFileRemove,sep=" ",header=None)
        individualsToRemove |= set([str(x) for x in drop3[0]])



#Subset bgen to only related individuals.
remove_individuals_hl = hl.literal(set(individualsToRemove))
print("subset to relateds")
bgen = bgen.filter_cols(~remove_individuals_hl.contains(bgen.s))


bgen = bgen.annotate_cols(pheno = phenosHailTable[bgen.s])
bgen = bgen.annotate_cols(covariates = covariatesHailTable[bgen.s])


gwas = hl.linear_regression_rows(
    y=bgen.pheno.Noise_added,
    x=bgen.GT.n_alt_alleles(),
    covariates=[1.0, bgen.covariates.PC1, bgen.covariates.PC2, bgen.covariates.PC3,bgen.covariates.PC4, bgen.covariates.PC5, bgen.covariates.PC6, bgen.covariates.PC7,
    bgen.covariates.PC8, bgen.covariates.PC9, bgen.covariates.PC10, bgen.covariates.ArrayNum])




# p = hl.plot.manhattan(gwas.p_value)
# p.
# show(p)

gwas_pd = gwas.to_pandas()
def makeMarkerName(row):
    alleles = row["alleles"]
    alleles.sort()
    return "%s:%s:%s" % (row["locus.contig"],row["locus.position"],":".join(alleles))

gwas_pd["SNP"] = gwas_pd.apply(makeMarkerName,axis=1)
gwas_pd.to_csv("hail/hailGWAS_%s_chrom%s_%s_%s_1core.tsv" % (args.analysisName,chrom,args.relatedness,args.sampleSize),sep='\t',index=False)



To help debug, can you define the following function:

def force_count(mt):
    print(mt.head(1)._force_count_rows())

and call it after these lines:
bgen = bgen.filter_cols(~remove_...
bgen = bgen.annotate_cols(pheno=...
bgen = bgen.annotate_cols(covariates=...

If it fails after one of those, that would be good to know.

Thank you for this suggestion, I believe this will help a lot with debugging. I pop the same error in the above code here at the first of the 3 steps:

#testing
def force_count(mt):
    print(mt.head(1)._force_count_rows())

#Subset bgen to only related individuals.
remove_individuals_hl = hl.literal(set(individualsToRemove))
print("subset to relateds")
bgen = bgen.filter_cols(~remove_individuals_hl.contains(bgen.s))
force_count(bgen)
print("filter done")

I get the “subset to relateds” but not the results from

force_count(bgen)
print("filter done")

OK, great. Could you possibly upload the log for that failing run as well? That should hopefully be enough for us to replicate/fix.

Yep, this is the link:

Apologies, I am not well versed in debugging hail; I looked through the log but do not yet see the issue.

Still investigating – I can replicate locally though, so nothing more necessary from your end.

1 Like

Fixed here:

We’ll release a new version when that goes in.

1 Like

0.2.67 is up, that should solve this problem.

1 Like

Thank you so much for your help! The filter step succeeds in interactive multi-core.