I’m trying to use Hail to compute linkage between SNPs and structural variants given a joint-called SNP
VCF and a joint-called structural variant VCF. I’ve imported both VCFs into MatrixTables and then produced a merged MatrixTable using union_rows. I’d like to compute the top linked SNPs for each SV within a given window, and therefore would like to produce behavior similar to executing ld_matrix with a radius parameter. Since I don’t need them, it would be nice to avoid computing LD between pairs of SNPs – I only need to find the correlations between SV/SNP pairs.
A strategy I thought might work would be to mimic the source of the ld_matrix function but to modify the starts and stops of the windows tuple that correspond to the SNP rows in the merged matrix table – for example by setting them to zero:
First of all, does this seem like a reasonable approach to the problem?
When I try to run this Hail doesn’t like the type of the windows parameter:
TypeError: _sparsify_row_intervals_expr: parameter 'intervals': expected expression of type tuple(array<int64>, array<int64>), found <TupleExpression of type tuple(ndarray<int64, 1>, ndarray<int64, 1>)>
I assume this is because I called the version of locus_windows that localizes the window indices (as compared to the source code of ld_matrix which calls locus_windows with _localize=False. I’d thought that calling hl.literal on the modified tuple would distribute it again, but clearly I’m missing something. How can I pass my modified window stops and starts back to the sparsify method? Or, if I’m going about this the wrong way, any other thoughts on how I can achieve what I want?
Hail performs these operations block-wise, so I fear this will do a lot of unnecessary computation. I think you might have a better time reading both MTs, converting to block matrices, and computing A @ B.T, i.e. treat each SNP and SV as a vector in sample space and compute the correlation. You probably need to mean center and variance normalize before doing that though.
@danking In the solution where I write this as a SV @ SNP.T operation, do you have any ideas about how I can sparsify the matrix to only compute correlations between SVs and SNPs that are within a certain distance from one another? This seemed straightforward to do using locus_windows and sparsify_row_intervals when the sites were merged together into one matrix, but I’m having trouble figuring out how to do it to the rectangular matrix using two separate sets of locus entries.
I think you’ll have an easier time using sparsify_rectangles or sparsify_band. If you can live with unnecessary extra elements, then sparisfy_band is the easiest to use. The band is specified in terms of matrix coordinates, not base pairs.
I agree it’s not straightforward to calculate the right rectangles to keep in this setting. I think you might need to collect all the loci and their indices within the original MTs, then slide across them to produce groups that you want the correlations of, then, for each group, find the minimum and maximum index along the rows and same for the columns. Some vague pseudocode:
sv_loci = SV.add_row_index('index').key_by().select('locus', 'index').collect()
snp_loci = SNP.add_row_index('index').key_by().select('locus', 'index').collect()
groups = [] # maybe this is just a list of lists of SNP loci and indices?
for sv in sv_loci:
for snp in snp_loci:
# ... track whether this sv should be compared to this snp
rectangles = []
for g in groups:
# ... calculate the min and max SNP loci for this group
# ... add this [sv_min_idx, sv_max_idx, snp_min_idx, snp_max_idx] to rectangles
(SV @ SNP.T).sparsify_rectangles(rectangles).write(...)
Thanks again for your help. I’m still trying to work out the details of the algorithm to exactly compute the rectangles for sparsify_rectangles, but in the meantime I thought I might try using an approximation with sparsify_band. To do so I’m grabbing the maximum window size (in index space) between SNPs, and then using that to set the upper and lower bounds of the band:
windows = hl.linalg.utils.locus_windows(snps.locus, radius=1e6)
max_window = max([windows[1][x] - windows[0][x] for x in range(len(windows[0]))])
cor_mat = sv_bm @ snp_bm.T
cor_mat = cor_mat.sparsify_band(lower = -1* max_window.item(), upper=max_window.item())
cor_mat.write("cor_mat.bm", overwrite=True, force_row_major=True)
When I try to run this I’m getting an error on the last task (out of 996 tasks):
FatalError: IllegalArgumentException: requirement failed: rows.last.<(m.rows)
Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 995 in stage 34.0 failed 4 times, most recent failure: Lost task 995.3 in stage 34.0 (TID 2310) (saturn-0ccce2a6-f64e-407a-88b3-165036d77d61-w-0.c.terra-c752dedc.internal executor 11): java.lang.IllegalArgumentException: requirement failed: rows.last.<(m.rows)
at is.hail.relocated.breeze.linalg.DenseMatrix$$anon$4.apply(DenseMatrix.scala:535)
at is.hail.relocated.breeze.linalg.DenseMatrix$$anon$4.apply(DenseMatrix.scala:526)
at is.hail.relocated.breeze.linalg.TensorLike.apply(Tensor.scala:142)
at is.hail.relocated.breeze.linalg.TensorLike.apply$(Tensor.scala:140)
at is.hail.relocated.breeze.linalg.DenseMatrix.apply(DenseMatrix.scala:52)
at is.hail.linalg.BlockMatrix.$anonfun$zeroBand$1(BlockMatrix.scala:529)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
Am I doing something wrong in how I’m computing and writing the matrix? Just trying to figure out if there’s something I need to fix in my matrices before optimizing everything.
Can you share max_window.item()? Did you override block_size when you created these block matrices (meaning did you provide a block_size argument to from_entry_expr)?
I created the block matrices with block_size set to None (matching the default):
snp_bm = BlockMatrix.from_entry_expr(snp_chr21_samplefilter.GT.n_alt_alleles(),
mean_impute=True,
center=True,
normalize=True,
block_size=None)
…
2023-01-09 21:25:36.792 Hail: INFO: Wrote all 24 blocks of 95906 x 312 matrix with block size 4096.
An update and a question – I implemented a solution based on running sliding windows over the snp and sv loci and passing that to sparsify_rectangles (actually I think I could have used sparsify_row_intervals instead) and was able to successfully compute the block matrix. So maybe the problem I reported above is just related to how I was using sparsify_band?
My quick question is, now that I have my LD matrix I’d like to find the SNPs that have the top LD (r^2) for each SV – in other words I’d like to get the column indexes and entry values for the top 10 entries in each row. What’s the right approach to do that? I tried something like this:
But it did a full shuffle and didn’t return the column indices along with the values. I was looking into using to_matrix_table_row_major and running take as a row aggregation, but noticed that to_matrix_table_row_major doesn’t support block-sparse matrices.
But we don’t currently have any support for that. I raised it with Tim. Better BM functionality is important to the long-term viability of Hail as a scientific computing tool.
I think your best bet is:
# transpose is nearly free but produces an entry table keyed the right way
bm = bm.T
ht = bm.entries()
ht = ht.group_by(ht.i).aggregate(hl.agg.take(hl.struct(row=ht.j, entry=ht.entry), 10, ordering=ht.entry))
ht = ht.rename({'i': 'col'})
That should avoid the shuffle. I also kept the row index, in case if that’s useful to you.