Errors after iterative calls to union_cols


#1

I want to compute the concordance of the platinum genomes na12878 vs 100 of my own na12878 sequences.
The hl.concordance() function seems like a good choice. However, it requires that you have two MatrixTables with matching sample names (it does not support 1:many comparison).

To get around this requirement, I thought I would:

  1. create a MatrixTable with a single sample of PG NA12878 (called pg_biallelic_vars_chr22)
  2. iteratively union_cols() this to itself 100x (na12878_pg_100x)
  3. duplicate my_na12878 sample ids, and key off these in the na12878_pg_100x
  4. save the 100x file so I could do other things with it (such as concordance)

This is the code I came up with (note that I filtered the PG calls down to biallelic and chr22 only):

n_samples = 100
na12878_pg_100x = pg_biallelic_vars_chr22
for n in range(1,n_samples):
    na12878_pg_100x = na12878_pg_100x.union_cols(pg_biallelic_vars_chr22)

na12878_pg_100x = na12878_pg_100x.add_col_index()
# we need the keyed sample name to match for the "concordance" calculation, so take it from the real na12878 samples
na12878_pg_100x = na12878_pg_100x.annotate_cols(s2=hl.array(my_na12878.s.take(n_samples))[hl.int(na12878_pg_100x.col_idx)])
na12878_pg_100x = na12878_pg_100x.key_cols_by('s2')
na12878_pg_100x.cols().show(25)
fname = 's3://my-s3-path/samples_one_by_one.mt'
na12878_pg_100x.write(fname, overwrite=True)   # <-- this is where it fails
na12878_pg_100x = hl.read_matrix_table(fname)

(Note that the code fails with writing no matter if i set stage_locally to True or False.)
I am able to run the above code with n_samples=25, but it fails at 50 or more with the following error:

FatalError: StackOverflowError: null

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 821, ip-172-19-23-116.ec2.internal, executor 153): java.lang.StackOverflowError
	at org.apache.spark.util.ByteBufferInputStream.read(ByteBufferInputStream.scala:49)
	at java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2662)
	at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2678)
	at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:3179)
	at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1683)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1552)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2286)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2210)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2068)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
...
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2286)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2210)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2068)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
	at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430)
	at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:479)
	at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1170)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2177)




Hail version: 0.2-961f76d14f1e
Error summary: StackOverflowError: null

So there are two issues to address here:

  1. why is this having trouble writing to the file?
  2. is there a better way to do the 1:many calculation for the na12878 comparisons.

StackOverflow with concordance function
#2
  1. why is this having trouble writing to the file?

It’s important to note that almost every operation in Hail is lazy – that is, everything upstream is executed when you are executing the write.

  1. is there a better way to do the 1:many calculation for the na12878 comparisons.

I think the iterative union_cols is quadratic in the number of samples – see this thread involving the same issue:

The issue is that Hail doesn’t have great ways right now to load a batch of single-sample VCFs into a single joint MatrixTable. We’re working on a gVCF merging algorithm, which we expect to come online in the next six months.

For the time being, things may work if you use my N * log(N) solution in the linked comment.


#3

Would this work?

In [4]: mt = hl.balding_nichols_model(1, 1, 100)
2018-12-03 11:33:00 Hail: INFO: balding_nichols_model: generating genotypes for 1 populations, 1 samples, and 100 variants...

In [5]: mt.count_cols()
Out[5]: 1

In [15]: mt2 = mt.annotate_cols(s = hl.range(0, 100).map(lambda i: hl.str(mt.sample_idx + i)))

In [16]: mt2 = mt2.explode_cols('s').key_cols_by('s')

In [17]: mt2.count_cols()
2018-12-03 11:37:52 Hail: INFO: Coerced sorted dataset
2018-12-03 11:37:53 Hail: INFO: Coerced sorted dataset
Out[17]: 100

In [18]: mt2.GT.show()
2018-12-03 11:37:57 Hail: INFO: Coerced sorted dataset
+---------------+------------+-----+------+
| locus         | alleles    | s   | GT   |
+---------------+------------+-----+------+
| locus<GRCh37> | array<str> | str | call |
+---------------+------------+-----+------+
| 1:1           | ["A","C"]  | "0" | 0/0  |
| 1:1           | ["A","C"]  | "1" | 0/0  |
| 1:1           | ["A","C"]  | "2" | 0/0  |
| 1:1           | ["A","C"]  | "3" | 0/0  |
| 1:1           | ["A","C"]  | "4" | 0/0  |
| 1:1           | ["A","C"]  | "5" | 0/0  |
| 1:1           | ["A","C"]  | "6" | 0/0  |
| 1:1           | ["A","C"]  | "7" | 0/0  |
| 1:1           | ["A","C"]  | "8" | 0/0  |
| 1:1           | ["A","C"]  | "9" | 0/0  |
+---------------+------------+-----+------+
showing top 10 rows

Not sure why we coerce the sorted dataset twice for count_cols.


#4

[for dev use]

Not sure why we coerce the sorted dataset twice for count_cols.

The first one is key_rows_by in BN, the second is the MatrixColsTable in countCols