Error summary: HailException: invalid interval expression: '17:8002615-8020342'

I am loading a coordinate file:
Chr Start End
17 8002615 8020342
1 9943428 9985501

into a table
then

gns = hl.import_table(‘gene_coord.tsv’)
iv = gns.collect()
ivL = [hl.parse_locus_interval(x.Chr+‘:’+x.Start+‘-’+x.End, reference_genome=‘GRCh38’) for x in iv]
x = hl.filter_intervals(ht, ivL)

This registers a

FatalError: HailException: invalid interval expression: ‘17:8002615-8020342’,

althought the format is CHROM:POS-POS

jrs

error messages are hard with intervals, since contigs are allowed to contain “-”… the core issue here is that contigs in grch38 have chr as a prefix. Also don’t need to use the parser here:

gns = hl.import_table(‘gene_coord.tsv’)
gns = gns.annotate(interval = hl.locus_interval(hl.str('chr') +gns.Chr, hl.int(gns.Start), hl.int(gns.End), reference_genome=‘GRCh38’))
ivL = gns.interval.collect()

Thank you.

I am using the intervals (see above) to filter gene regions from the gnomad.genomes.v3.1.sites.ht table with
x = hl.filter_intervals(ht, [interval]).select(‘freq’)
then converting to a data frame with
with
x.to_pandas()
That works mostly but for some region intervals I run into OutOfMemoryError, even with the options
DeepIntronicVariants_queryGnomAD]$ PYSPARK_SUBMIT_ARGS=“–driver-memory 200g --executor-memory 200g pyspark-shell”

This is a large overhead for a relatively small table selected on ‘freq’.

jrs

The gnomAD sites table is pretty huge… even a 20kb base pair produces a large array in memory. What’s the full stack trace for the OutOfMemoryError?

Can you paste exactly the command you used to start Python? In particular, the PYSPARK_SUBMIT_ARGS... needs to be on the same line and come before python or ipython or jupyter notebook. If you want to place it on a separate line you have to use (note the export):

export PYSPARK_SUBMIT_ARGS="--driver-memory 200g --executor-memory 200g pyspark-shell"

This is my one command line:

PYSPARK_SUBMIT_ARGS=“–driver-memory 64g --executor-memory 64g pyspark-shell” python <*.py>

err=FatalError(‘OutOfMemoryError: Requested array size exceeds VM limit\n\nJava stack trace:\njava.lang.OutOfMemoryError: Requested array size exceeds VM limit\n\tat java.base/java.util.Arrays.copyOf(Arrays.java:3745)\n\tat java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)\n\tat java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)\n\tat java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)\n\tat java.base/java.io.OutputStream.write(OutputStream.java:122)\n\tat is.hail.io.StreamOutputBuffer.writeByte(OutputBuffers.scala:74)\n\tat __C1429etypeEncode.__m1440ENCODE_SBaseStructPointer_TO_o_struct_of_o_int32ANDo_float64ANDo_int32ANDo_int32END(Unknown Source)\n\tat __C1429etypeEncode.__m1439ENCODE_SIndexablePointer_TO_o_array_of_o_struct_of_o_int32ANDo_float64ANDo_int32ANDo_int32END(Unknown Source)\n\tat __C1429etypeEncode.__m1438ENCODE_SIndexablePointer_TO_o_array_of_o_array_of_o_struct_of_o_int32ANDo_float64ANDo_int32ANDo_int32END(Unknown Source)\n\tat __C1429etypeEncode.__m1431ENCODE_SBaseStructPointer_TO_o_struct_of_o_array_of_o_struct_of_o_binaryANDo_int32ENDANDo_array_of_o_array_of_o_binaryANDo_array_of_o_array_of_o_struct_of_o_int32ANDo_float64ANDo_int32ANDo_int32ENDEND(Unknown Source)\n\tat __C1429etypeEncode.__m1430ENCODE_SBaseStructPointer_TO_o_struct_of_o_struct_of_o_array_of_o_struct_of_o_binaryANDo_int32ENDANDo_array_of_o_array_of_o_binaryANDo_array_of_o_array_of_o_struct_of_o_int32ANDo_float64ANDo_int32ANDo_int32ENDENDEND(Unknown Source)\n\tat __C1429etypeEncode.apply(Unknown Source)\n\tat is.hail.io.CompiledEncoder.writeRegionValue(Encoder.scala:32)\n\tat is.hail.io.AbstractTypedCodecSpec.$anonfun$encode$1(CodecSpec.scala:38)\n\tat is.hail.io.AbstractTypedCodecSpec.$anonfun$encode$1$adapted(CodecSpec.scala:38)\n\tat is.hail.io.AbstractTypedCodecSpec$$Lambda$3545/0x00002aec6d0f9040.apply(Unknown Source)\n\tat is.hail.utils.package$.using(package.scala:635)\n\tat is.hail.io.AbstractTypedCodecSpec.encode(CodecSpec.scala:38)\n\tat is.hail.io.AbstractTypedCodecSpec.encode$(CodecSpec.scala:36)\n\tat is.hail.io.TypedCodecSpec.encode(TypedCodecSpec.scala:19)\n\tat is.hail.backend.spark.SparkBackend.encodeToBytes(SparkBackend.scala:505)\n\tat is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:488)\n\tat is.hail.backend.spark.SparkBackend$$Lambda$2077/0x00002aec1762e040.apply(Unknown Source)\n\tat is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:70)\n\tat is.hail.backend.ExecuteContext$$$Lambda$1178/0x00002aec14619040.apply(Unknown Source)\n\tat is.hail.utils.package$.using(package.scala:635)\n\tat is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:70)\n\tat is.hail.backend.ExecuteContext$$$Lambda$1173/0x00002aec14813040.apply(Unknown Source)\n\tat is.hail.utils.package$.using(package.scala:635)\n\tat is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)\n\tat is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:59)\n\tat is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:339)\n\n\n\nHail version: 0.2.105-acd89e80c345\nError summary: OutOfMemoryError: Requested array size exceeds VM limit’), type(err)=<class ‘hail.utils.java.FatalError’>

Aha! As I suspected. There’s a 2G limit on the serialized size of a query result right now. We can fix this, but unlikely we can get to it before the new year.

Thank you both for a prompt and competent response.

jrs

You might try exporting to a file and loading that with pandas:

x.export("gs://foo/bar.tsv")
pandas.read_csv(hl.hadoop_open("gs://foo/bar.tsv"))