Efficient way to get maximum and minimum locus position from a mt?

jimmy_1 · July 21, 2021, 2:43pm

I have a large mt contains data from a single chromosome, and I’ve been dealing with memory issues when doing basic operations. Primarily I’d like to find the min and max position from the mt’s locus, but even doing operations such as mt.head(1).locus.take(1) and even other operations like mt.count_rows() yield memory errors such like Exception in thread "Executor task launch worker for task 31.0 in stage 0.0 (TID 31)" java.lang.OutOfMemoryError: Java heap space, is there a better way to accomplish what I’m trying to do pr is this a larger issue with the memory configuration?

danking · July 21, 2021, 2:52pm

Hey @jimmy_1 !

Sorry to hear you’re having trouble, can you share your whole script? It’s hard to diagnose these kinds of issues without knowing what went into the definition of mt.

jimmy_1 · July 21, 2021, 3:17pm

thanks for the reply, here’s how it looks:

vcf_path = "c1_b2_v1.vcf.bgz"
mt = hl.import_vcf(vcf_path)
mt.describe()
small_mt = mt.head(10)
r = mt.rows()
c = mt.cols()
# following commands raise memory errors 
r.count()
r.locus.take(1)
mt.head(1).rows().locus.take(1)[0].position

output of mt.describe() :

Global fields:
None

Column fields:
‘s’: str

Row fields:
‘locus’: locus
‘alleles’: array
‘rsid’: str
‘qual’: float64
‘filters’: set
‘info’: struct {
AF: array,
AQ: array,
AC: array,
AN: int32
}

Entry fields:
‘GT’: call
‘RNC’: array
‘DP’: int32
‘AD’: array
‘GQ’: int32
‘PL’: array

Column key: [‘s’]
Row key: [‘locus’, ‘alleles’]

danking · July 21, 2021, 5:47pm

Are you running this on a Spark cluster or something else? If you’re not on a Spark cluster, you probably need to set PYSPARK_SUBMIT_ARGS to permit Hail to use more memory.

I also strongly, strongly recommend you convert your data into a Hail native format before continuing with any serious analysis. You can do that like this:

import hail as hl
hl.import_vcf('c1_b2_v1.vcf.bgz').write('c1_b2_v1.mt')

Then you can load that dataset with:

mt = hl.read_matrix_table('c1_b2_v1.mt')

jimmy_1 · July 21, 2021, 6:06pm

Thanks it seems like changing the pyspark arguments worked!

One other quick question related to your response, what is the difference between loading an mt from native format vs. loading from a VCF. I think it may be somewhat related since I’ve only ever loaded from the native format and have never had an issue until now when I’ve been loading the VCF file(s).

danking · July 22, 2021, 2:21pm

A compressed VCF is, at its core, still a text file, and the process of converting the string “1/1” into a genotype call record (an integer) is expensive in terms of memory and CPU time. Not tremendously so, but when you do that n_variants * n_samples number of times, it adds up!

We also haven’t aggressively optimized the VCF path because our local users generally work with native Matrix Tables.

Topic		Replies	Views
Best ways to filter Mt down to GT values Hail Query & hailctl	1	345	March 28, 2023
Calculate minimal representation Hail Query & hailctl	12	369	July 17, 2021
How do I create a locus and allele keyed table from chromosome, start position, end position, reference allele and alt allele? Hail Query & hailctl	2	499	March 31, 2021
Liftover Range Exception and Support for chrom 'MT' Hail Query & hailctl	3	621	September 9, 2019
Prepare hail entries for spark.ml Hail Query & hailctl	3	446	October 29, 2019

Efficient way to get maximum and minimum locus position from a mt?

Global fields: None

Column fields: ‘s’: str

Row fields: ‘locus’: locus ‘alleles’: array ‘rsid’: str ‘qual’: float64 ‘filters’: set ‘info’: struct { AF: array, AQ: array, AC: array, AN: int32 }

Entry fields: ‘GT’: call ‘RNC’: array ‘DP’: int32 ‘AD’: array ‘GQ’: int32 ‘PL’: array

Column key: [‘s’] Row key: [‘locus’, ‘alleles’]

Related topics

Global fields:
None

Column fields:
‘s’: str

Row fields:
‘locus’: locus
‘alleles’: array
‘rsid’: str
‘qual’: float64
‘filters’: set
‘info’: struct {
AF: array,
AQ: array,
AC: array,
AN: int32
}

Entry fields:
‘GT’: call
‘RNC’: array
‘DP’: int32
‘AD’: array
‘GQ’: int32
‘PL’: array

Column key: [‘s’]
Row key: [‘locus’, ‘alleles’]