NegativeArraySizeException running hwe_normalized_pca

Hi,

I’m running part of the gnomad_qc pipeline (v3), specifically ‘run_pc_relate’ which relies on Hail’s ‘hwe_normalized_pca’. I’m running the PCA with 2 components on ~2k samples. Tests on small subsets ran without issue, but on scaling up, I’ve found it taking up around 1 TB of RAM (2+ TB available) before crashing with the error:

hail.utils.java.FatalError: NegativeArraySizeException: null

With the full trace:

hail.utils.java.FatalError: NegativeArraySizeException: null

Java stack trace:
com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException
Serialization trace:
values (org.apache.spark.sql.catalyst.expressions.GenericRow)
	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:259)
	at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$4(TorrentBroadcast.scala:321)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:323)
	at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140)
	at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:95)
	at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:35)
	at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:77)
	at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1509)
	at is.hail.backend.spark.SparkBackend.broadcast(SparkBackend.scala:342)
	at is.hail.methods.PCA.execute(PCA.scala:60)
	at is.hail.expr.ir.functions.WrappedMatrixToTableFunction.execute(RelationalFunctions.scala:52)
	at is.hail.expr.ir.TableToTableApply.execute(TableIR.scala:3365)
	at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:61)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:864)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:58)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:451)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:487)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:70)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:70)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:59)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:339)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:484)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:483)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

java.lang.NegativeArraySizeException: null
	at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:542)
	at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:306)
	at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:300)
	at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:162)
	at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:307)
	at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:300)
	at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:162)
	at com.esotericsoftware.kryo.util.MapReferenceResolver.addWrittenObject(MapReferenceResolver.java:41)
	at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:681)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:646)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.twitter.chill.WrappedArraySerializer.write(WrappedArraySerializer.scala:28)
	at com.twitter.chill.WrappedArraySerializer.write(WrappedArraySerializer.scala:23)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575)
	at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:79)
	at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
	at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
	at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
	at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:259)
	at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$4(TorrentBroadcast.scala:321)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:323)
	at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:140)
	at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:95)
	at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:35)
	at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:77)
	at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1509)
	at is.hail.backend.spark.SparkBackend.broadcast(SparkBackend.scala:342)
	at is.hail.methods.PCA.execute(PCA.scala:60)
	at is.hail.expr.ir.functions.WrappedMatrixToTableFunction.execute(RelationalFunctions.scala:52)
	at is.hail.expr.ir.TableToTableApply.execute(TableIR.scala:3365)
	at is.hail.expr.ir.TableIR.analyzeAndExecute(TableIR.scala:61)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:864)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:58)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.evaluate$1(LowerOrInterpretNonCompilable.scala:20)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.rewrite$1(LowerOrInterpretNonCompilable.scala:67)
	at is.hail.expr.ir.LowerOrInterpretNonCompilable$.apply(LowerOrInterpretNonCompilable.scala:72)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.transform(LoweringPass.scala:69)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.LowerOrInterpretNonCompilablePass$.apply(LoweringPass.scala:64)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:451)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:487)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:70)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:70)
	at is.hail.utils.package$.using(package.scala:635)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:59)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:339)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:484)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:483)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Hail version: 0.2.108-fc03e9d5dc08
Error summary: NegativeArraySizeException: null

I found the same error referenced on an old post (2016) for the same issue though there weren’t any available details beyond the statement of the problem:

Another person encountered the same error (2018) for a different process posting on the GitHub page which was identified as a memory leak. There the ticket was closed when it was ‘fixed’ albeit there also weren’t any further details:

I’m just setting up some tests with more intermediate sets of 100-500 samples to see if there’s just a threshold where it fails but any insight into what might be the cause would be appreciated. Cheers,

Rohan

Hey @rabraham!

I can’t help you without seeing the code you’re running. Can you share the full script you’re executing? It also looks like you’re on a slightly old version of Hail. It’s difficult for us to support old versions because we’re just a small open source team. Can you update to the latest version of Hail, 0.2.119?

I also need to know more about your environment. What kind of Spark cluster are you using? What environment variables are set? How did you initialize Hail? What’s the output of

cat $(find_spark_home.py )/conf/spark-defaults.conf

Issue 3583 is unrelated to your exception, it comes from memory management code. Your error is coming from a library used by Spark called Kryo. This usually means Kryo is running out of memory.

This error plus your claim of 1 TB of RAM suggests that you’re trying to run PCA on an absolutely gigantic number of variants. How are you filtering your variants before PCA? PCA should only be run on a handful of common variants, like 5,000 to 100,000. 100,000 is probably unnecessary unless you have a ton of fine-grained structure or you’re working with a really large dataset.

Hi @danking,

I see, I’ve allocated up to 2 TB in tests so far, though the error generally hits at around 1 TB as I mentioned. In terms of the number of variants I can give you the actual number, i.e.

hwe_normalize: found 53344425 variants after filtering out monomorphic sites.

So yes, somewhat higher than you’re recommending…(I have no experience running this unfortunately). Given the numbers you’re quoting I’m guessing it’s going to come down to that, but to answer the rest. The code is from the gnomad_qc (v3) pipeline modified slightly to run locally rather than on GCP servers. Specifically the error occurs when running hwe_normalized_pca on a matrix table generated using the following code:

    old_qc_sites = hl.read_table(os.path.join(basepath, 'gnomad.genomes.v3.1.sites.ht'))
    old_qc_sites = old_qc_sites .key_by("locus")
    purcell5k = hl.read_table(os.path.join(basepath,'purcell5k_hg38.ht'))
    qc_sites = old_qc_sites .union(purcell5k, unify=True)
    lcr_intervals = hl.read_table(os.path.join(basepath, 'LCRFromHengHg38.ht'))
    qc_sites = old_qc_sites .filter(hl.is_missing(lcr_intervals[old_qc_sites .key]))

    mt = get_gnomad_v3_mt(
        key_by_locus_and_alleles=True,
        remove_hard_filtered_samples=False,
    )

    mt = mt.select_entries(
        "END", GT=mt.LGT, adj=get_adj_expr(mt.LGT, mt.GQ, mt.DP, mt.LAD)
    )
    mt = densify_sites(mt, qc_sites, hl.read_table(os.path.join(basepath, 'HostSeq_last_END_positions.ht')))

    # Filter MT to bi-allelic SNVs that are found in the v2 QC and p5k HT
    mt = mt.filter_rows(
        (hl.len(mt.alleles) == 2)
        & hl.is_snp(mt.alleles[0], mt.alleles[1])
        & (qc_sites[mt.locus].alleles == mt.alleles)
    )
    mt = mt.checkpoint(
        os.path.join(basepath, 'qc_mt_checkpoint.mt'), overwrite=True
    )
    mt = mt.naive_coalesce(5000)
    mt = mt.checkpoint(
        os.path.join(basepath, 'qc_mt_checkpoint_repartitioned.mt'),
        overwrite=True,
    )

    qc_mt = get_qc_mt(
        mt,
        min_af=min_af,
        min_inbreeding_coeff_threshold=min_inbreeding_coeff_threshold,
        min_hardy_weinberg_threshold=None,
        ld_r2=None,
        filter_lcr=False,
        filter_decoy=False,
        filter_segdup=False,
    )

There ‘basepath’ is just referencing my local environment and ‘gnomad.genomes.v3.1.sites.ht’, ‘purcell5k_hg38.ht’, and ‘LCRFromHengHg38.ht’ are downloaded from the following GC buckets:

gs://gcp-public-data--gnomad/release/3.1/ht/genomes/gnomad.genomes.v3.1.sites.ht
gs://gcp-public-data--gnomad/resources/grch38/purcell_5k_intervals/purcell5k.ht
gs://gcp-public-data--gnomad/resources/grch38/lcr_intervals/LCRFromHengHg38.ht

The data being loaded by mt = get_gnomad_v3_mt() have been replaced with our own results (currently some 2k+ WGS g.vcf files incorporated into a .vds database, ultimately scaling up to 10k).

I will update my version of Hail. In terms of remaining information you’re asking for in the order asked:

  1. We are running Spark in local mode on a large machine we have available
  2. Environmental variables I’m setting include:
export JAVA_OPTS="-Xmx2048g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -XX:+UseParallelGC -XX:ParallelGCThreads=8 -Dsamjdk.buffer_size=4194304"
export PYSPARK_SUBMIT_ARGS="--driver-memory 2048g --executor-memory 2048g pyspark-shell"
  1. Hail is initialized by running:
hl.init(log=os.path.join(basepath, 'sample_qc.log'), spark_conf={'spark.driver.memory': '2048g'}, default_reference="GRCh38")
  1. And finally
cat $(find_spark_home.py )/conf/spark-defaults.conf

Gives:

bash: find_spark_home.py: command not found
cat: /conf/spark-defaults.conf: No such file or directory

Which I’m assuming is because I’m not running on an actual Spark cluster (though I’m also new to Spark, so I may just be misunderstanding).

  1. I think the answer as far as filtering goes is that it’s not working as intended given the number quoted above for variants detected/your own guidelines. There is a filtering stage included above which is taken directly from the gnomad_qc code. Thanks very much for your help! Cheers,

Rohan

Thanks for the detailed information!

I think my gut reaction is to use sample_rows to get down to like 10k loci or so. The exact number is a science question that I’m not able to advise on. Somewhere between 5k and 100k is reasonable.

You’ll want to get the rows().select('locus', 'alleles') after that and save it to a well known location so that you know exactly which variants you’re using for PCA. Then load that back and semi_join_rows to filter to just the rows in that saved table, then use that to do PCA, then save PCA results, then use that as the input to PC-Relate. Does run_pc_relate accept a PC table or field as an argument?

Hi @danking,

Thanks for the advice, I’ll do some testing based on that. I’m not sure what conditions would be best to filter on to get down to the requisite number but there are people on my end I can ask about that. Though if you have thoughts there as well feel free to add them. (Edit: disregard, I hadn’t realized sample_rows was just random sub-sampling, I imagine that should work well enough).

There are a couple stages to the PCA calculations. The main ones are run_pc_relate (runs pca to determine relatedness between sample pairs) and run_pca (genotype pca on unrelated samples). The run_pc_relate part makes use of the ‘qc_mt’ matrix table I pasted the code in to generate above and returns a Hail table that is ultimately used to identify any related samples to be excluded by run_pca.

For run_pc_relate the only arguments are fed to the hwe_normalized_pca function, currently as:

eig, scores, _ = hl.hwe_normalized_pca(qc_mt.GT, k=2, compute_loadings=False)

With k = 2 for 2-component PCA. Then run_pca is called as:

pop_pca_eigenvalues, pop_pca_scores_ht, pop_pca_loadings_ht = run_pca(include_unreleasable_samples: bool, number_of_PCA_components: int, samples_to_drop: hl.Table)

Thanks for the insights! I will confer with others on my end and see where we get. Cheers,

Rohan

1 Like