PCA halts with NegativeArraySizeException


#1

Hi, I have a single Ubuntu 16.04 system with plenty of ram, with Spark in standalone mode. I am able to perform PCA on single WGS GVCF files but am running into issues when I tried to combine all chromosomes and then run PCA. I’ve attempted to figure out limit but I was wondering what I can do to avoid the error seen below:

Python 2.7.12 (default, Nov 19 2016, 06:48:10)
Type "copyright", "credits" or "license" for more information.

IPython 5.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from hail import *

In [2]: hc = HailContext(tmp_dir='/mnt/adsp/results/VCF/test_tileDB/1023_samples/ready/tmp')
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 2.1.0
SparkUI available at http://10.10.5.50:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.1-d506d25

In [3]: vds1 = hc.read("pre-qc.section????.vds")
[Stage 0:=====================>                              (509 + 686) / 1247]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 2:=========>                                          (221 + 982) / 1247]2017-10-17 17:48:14 Hail: INFO: Using sample and global annotations from file:/mnt/adsp/results/VCF/test_tileDB/otto-flagged/pre-qc.section1319.vds
[Stage 6:===================================================>(1564 + 24) / 1588]2017-10-17 17:48:24 Hail: INFO: Coerced sorted dataset

In [4]: pca = vds1.pca('sa.pca', k=5, eigenvalues='global.eigen')
2017-10-17 17:48:42 Hail: INFO: Running PCA with 5 components...
[Stage 7:==================================================> (1554 + 34) / 1588]---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-4-7e9abc7e215f> in <module>()
----> 1 pca = vds1.pca('sa.pca', k=5, eigenvalues='global.eigen')

<decorator-gen-494> in pca(self, scores, loadings, eigenvalues, k, as_array)

/opt/hail-git/python/hail/java.pyc in handle_py4j(func, *args, **kwargs)
    119         raise FatalError('%s\n\nJava stack trace:\n%s\n'
    120                          'Hail version: %s\n'
--> 121                          'Error summary: %s' % (deepest, full, Env.hc().version, deepest))
    122     except py4j.protocol.Py4JError as e:
    123         if e.args[0].startswith('An error occurred while calling'):

FatalError: NegativeArraySizeException: null

Java stack trace:
com.esotericsoftware.kryo.KryoException: java.lang.NegativeArraySizeException
Serialization trace:
altAlleles (is.hail.variant.Variant)
        at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101)
        at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
        at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:36)
        at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
        at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
        at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
        at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
        at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:207)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:268)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:268)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
        at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:269)
        at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126)
        at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
        at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
        at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
        at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1411)
        at is.hail.stats.ToHWENormalizedIndexedRowMatrix$.apply(ComputeRRM.scala:105)
        at is.hail.methods.SamplePCA$.variantsSvdAndScores(SamplePCA.scala:51)
        at is.hail.methods.SamplePCA$.apply(SamplePCA.scala:31)
        at is.hail.variant.VariantDatasetFunctions$.pca$extension(VariantDataset.scala:655)
        at is.hail.variant.VariantDatasetFunctions.pca(VariantDataset.scala:642)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)java.lang.NegativeArraySizeException: null
        at com.esotericsoftware.kryo.util.IdentityObjectIntMap.resize(IdentityObjectIntMap.java:447)
        at com.esotericsoftware.kryo.util.IdentityObjectIntMap.putStash(IdentityObjectIntMap.java:245)
        at com.esotericsoftware.kryo.util.IdentityObjectIntMap.push(IdentityObjectIntMap.java:239)
        at com.esotericsoftware.kryo.util.IdentityObjectIntMap.put(IdentityObjectIntMap.java:135)
        at com.esotericsoftware.kryo.util.MapReferenceResolver.addWrittenObject(MapReferenceResolver.java:41)
        at com.esotericsoftware.kryo.Kryo.writeReferenceOrNull(Kryo.java:658)
        at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:547)
        at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
        at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
        at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:36)
        at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
        at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
        at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
        at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
        at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
        at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
        at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
        at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:207)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:268)
        at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:268)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1303)
        at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:269)
        at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:126)
        at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
        at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
        at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:56)
        at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1411)
        at is.hail.stats.ToHWENormalizedIndexedRowMatrix$.apply(ComputeRRM.scala:105)
        at is.hail.methods.SamplePCA$.variantsSvdAndScores(SamplePCA.scala:51)
        at is.hail.methods.SamplePCA$.apply(SamplePCA.scala:31)
        at is.hail.variant.VariantDatasetFunctions$.pca$extension(VariantDataset.scala:655)
        at is.hail.variant.VariantDatasetFunctions.pca(VariantDataset.scala:642)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:745)


Hail version: 0.1-d506d25
Error summary: NegativeArraySizeException: null
In [5]: vds1.summarize().report()
[Stage 8:===================================================>(1566 + 22) / 1588]
         Samples: 18
        Variants: 8800678
       Call Rate: 0.999652
         Contigs: ['chr22', 'chr19', 'chr15', 'chr18', 'chr20', 'chr13', 'chr14', 'chr17', 'chr21', 'chr16']
   Multiallelics: 0
            SNPs: 8800678
            MNPs: 0
      Insertions: 0
       Deletions: 0
 Complex Alleles: 0
    Star Alleles: 0
     Max Alleles: 2

In [6]: 

thanks in advance for any help you can provide.