I want to compute the concordance of the platinum genomes na12878 vs 100 of my own na12878 sequences.
The hl.concordance()
function seems like a good choice. However, it requires that you have two MatrixTables with matching sample names (it does not support 1:many comparison).
To get around this requirement, I thought I would:
- create a MatrixTable with a single sample of PG NA12878 (called pg_biallelic_vars_chr22)
- iteratively
union_cols()
this to itself 100x (na12878_pg_100x) - duplicate my_na12878 sample ids, and key off these in the na12878_pg_100x
- save the 100x file so I could do other things with it (such as concordance)
This is the code I came up with (note that I filtered the PG calls down to biallelic and chr22 only):
n_samples = 100
na12878_pg_100x = pg_biallelic_vars_chr22
for n in range(1,n_samples):
na12878_pg_100x = na12878_pg_100x.union_cols(pg_biallelic_vars_chr22)
na12878_pg_100x = na12878_pg_100x.add_col_index()
# we need the keyed sample name to match for the "concordance" calculation, so take it from the real na12878 samples
na12878_pg_100x = na12878_pg_100x.annotate_cols(s2=hl.array(my_na12878.s.take(n_samples))[hl.int(na12878_pg_100x.col_idx)])
na12878_pg_100x = na12878_pg_100x.key_cols_by('s2')
na12878_pg_100x.cols().show(25)
fname = 's3://my-s3-path/samples_one_by_one.mt'
na12878_pg_100x.write(fname, overwrite=True) # <-- this is where it fails
na12878_pg_100x = hl.read_matrix_table(fname)
(Note that the code fails with writing no matter if i set stage_locally
to True
or False
.)
I am able to run the above code with n_samples=25
, but it fails at 50 or more with the following error:
FatalError: StackOverflowError: null
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 821, ip-172-19-23-116.ec2.internal, executor 153): java.lang.StackOverflowError
at org.apache.spark.util.ByteBufferInputStream.read(ByteBufferInputStream.scala:49)
at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2662)
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2678)
at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:3179)
at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1683)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2286)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2210)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2068)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
...
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2286)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2210)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2068)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479)
at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2177)
Hail version: 0.2-961f76d14f1e
Error summary: StackOverflowError: null
So there are two issues to address here:
- why is this having trouble writing to the file?
- is there a better way to do the 1:many calculation for the na12878 comparisons.