Start and end position per partition

Hi hail team! I have a question about extracting chromosomal start and end positions from MatrixTable partitions. I need to export a large callset to VCF (sharded) and report the first and last locus of each VCF shard. Is there a fast way to get this information from the MatrixTable partitions being exported? Or would it be faster to query each VCF shard?

I’d appreciate any ideas – thank you!!

Hi @ch-kr, there’s no way to determine ahead of time what the partitioning might be for an arbitrary pipeline on a matrix table. If you happen to have a matrix table already that you are exporting (and not doing complex operations or filters that could change the partitioning), then the partitioning should stay the same and you can find that information in the matrix table metadata. Otherwise, you’ll have to examine the output VCF shards. This is something we can incorporate in the future by outputting the partitioning information as part of export_vcf, but it does not exist at the moment. If this would be a valuable feature to you, please let us know!

1 Like

thanks for responding! sorry, I should have clarified – I’m planning to checkpoint the MatrixTable just before exporting and was wondering if there was a fast way to check the loci in the partitions once they are written. basic question, where is the MT metadata stored?

re: adding the partitioning: I’m not sure how many pipelines need this information, but it might be helpful to add if it isn’t too complicated!

That’s great. If you look inside of a saved matrix table, you should see in basically every directory a metadata.json.gz. In particular, if you look inside my_matrix_table.mt/rows/rows you should find a parts directory, which is the partition data, and a metadata zip. That json file includes the start and end keys for the row partitions. e.g.

  "_jRangeBounds": [
    {
      "start": {
        "locus": {
          "contig": "1",
          "position": 904165
        },
        "alleles": [
          "G",
          "A"
        ]
      },
      "end": {
        "locus": {
          "contig": "X",
          "position": 154087368
        },
        "alleles": [
          "T",
          "A"
        ]
      },
      "includeStart": true,
      "includeEnd": true
    }
  ],

for a matrix table with a single partition.

1 Like

Hi again! Thanks for the tip about the metadata file. I just realized I have another question related to this topic. I forgot that I have a sparse MT that gets densified prior to export, and I was checking the positions in the partitions of the sparse MT to get the dense VCF shard start/stops. Is there still a fast/easy way to get the positions without having to checkpoint the full dense MT?

We plan to write something that doesn’t rely on internal details, but for now this should work:

def part_min_and_max(part):
    keys = part.map(lambda x: x.select('locus', 'alleles'))
    return hl.struct(start=keys[0], end=keys[-1])

ht._map_partitions(lambda p: hl.array([part_min_and_max(p)])).collect()
1 Like

thank you!!!