I have a large mt contains data from a single chromosome, and I’ve been dealing with memory issues when doing basic operations. Primarily I’d like to find the min and max position from the mt’s locus, but even doing operations such as mt.head(1).locus.take(1)
and even other operations like mt.count_rows()
yield memory errors such like Exception in thread "Executor task launch worker for task 31.0 in stage 0.0 (TID 31)" java.lang.OutOfMemoryError: Java heap space
, is there a better way to accomplish what I’m trying to do pr is this a larger issue with the memory configuration?
Hey @jimmy_1 !
Sorry to hear you’re having trouble, can you share your whole script? It’s hard to diagnose these kinds of issues without knowing what went into the definition of mt
.
thanks for the reply, here’s how it looks:
vcf_path = "c1_b2_v1.vcf.bgz"
mt = hl.import_vcf(vcf_path)
mt.describe()
small_mt = mt.head(10)
r = mt.rows()
c = mt.cols()
# following commands raise memory errors
r.count()
r.locus.take(1)
mt.head(1).rows().locus.take(1)[0].position
output of mt.describe()
:
Global fields:
None
Column fields:
‘s’: str
Row fields:
‘locus’: locus
‘alleles’: array
‘rsid’: str
‘qual’: float64
‘filters’: set
‘info’: struct {
AF: array,
AQ: array,
AC: array,
AN: int32
}
Entry fields:
‘GT’: call
‘RNC’: array
‘DP’: int32
‘AD’: array
‘GQ’: int32
‘PL’: array
Column key: [‘s’]
Row key: [‘locus’, ‘alleles’]
Are you running this on a Spark cluster or something else? If you’re not on a Spark cluster, you probably need to set PYSPARK_SUBMIT_ARGS
to permit Hail to use more memory.
I also strongly, strongly recommend you convert your data into a Hail native format before continuing with any serious analysis. You can do that like this:
import hail as hl
hl.import_vcf('c1_b2_v1.vcf.bgz').write('c1_b2_v1.mt')
Then you can load that dataset with:
mt = hl.read_matrix_table('c1_b2_v1.mt')
Thanks it seems like changing the pyspark arguments worked!
One other quick question related to your response, what is the difference between loading an mt from native format vs. loading from a VCF. I think it may be somewhat related since I’ve only ever loaded from the native format and have never had an issue until now when I’ve been loading the VCF file(s).
A compressed VCF is, at its core, still a text file, and the process of converting the string “1/1” into a genotype call record (an integer) is expensive in terms of memory and CPU time. Not tremendously so, but when you do that n_variants * n_samples number of times, it adds up!
We also haven’t aggressively optimized the VCF path because our local users generally work with native Matrix Tables.