About sparse matrix table

Hi, I am interested to try out using sparse matrix table but I am a bit confuse about the workflow to properly load the data.

1> I noticed that transform_gvcf should be run on a mt with only 1 column. That means if I have an individual gvcf I am god to go, but if I have a gvcf with multiple individuals, I need to select each column and transform it to its own sparse mt… am I right ?

2> If I get it right, combine_gvcf take a list of sparse mt. That means if I get a sparse mt for each of my individuals at Q1, then I can combine them into one unique sparse mt using that function… am I right ?

3> Can I combine_gvcf incrementally ? let say I have 3 individuals. First I use transform_gvcf of individual A into A.smt and B into B.smt. Then I combine_gvcf of [A.smt, B.smt] into all.smt. Then I transform_gvcf of individual C into C.smt. Can I combine_gvcf of [all.smt, C.smt] into all.smt ?

Thanks

Thank you for your interest! To answer your questions:

  1. Sort of. The issue here is the correctness of INFO fields. The transform_gvcf method copies every INFO field (except DP and END) into the matrix table entry into a field called gvcf_info. Recomputing those INFO fields requires doing an aggregation over those entries. If you have a field such as VAR_DP which would then be aggregated back with a sum, you will end up with a result that is too large because we will have copied the original VAR_DP multiple times.
  2. Correct.
  3. Yes! combine_gvcfs inputs and outputs are sparse matrix tables. Therefore we can easily add samples in an incremental fashion. Be aware this requires a read and write of all data so it can be costly in terms of compute time, but we can incrementally add samples easily.
1 Like