Start and end position per partition

Hi hail team! I have a question about extracting chromosomal start and end positions from MatrixTable partitions. I need to export a large callset to VCF (sharded) and report the first and last locus of each VCF shard. Is there a fast way to get this information from the MatrixTable partitions being exported? Or would it be faster to query each VCF shard?

I’d appreciate any ideas – thank you!!

Hi @ch-kr, there’s no way to determine ahead of time what the partitioning might be for an arbitrary pipeline on a matrix table. If you happen to have a matrix table already that you are exporting (and not doing complex operations or filters that could change the partitioning), then the partitioning should stay the same and you can find that information in the matrix table metadata. Otherwise, you’ll have to examine the output VCF shards. This is something we can incorporate in the future by outputting the partitioning information as part of export_vcf, but it does not exist at the moment. If this would be a valuable feature to you, please let us know!

1 Like

thanks for responding! sorry, I should have clarified – I’m planning to checkpoint the MatrixTable just before exporting and was wondering if there was a fast way to check the loci in the partitions once they are written. basic question, where is the MT metadata stored?

re: adding the partitioning: I’m not sure how many pipelines need this information, but it might be helpful to add if it isn’t too complicated!

That’s great. If you look inside of a saved matrix table, you should see in basically every directory a metadata.json.gz. In particular, if you look inside my_matrix_table.mt/rows/rows you should find a parts directory, which is the partition data, and a metadata zip. That json file includes the start and end keys for the row partitions. e.g.

  "_jRangeBounds": [
    {
      "start": {
        "locus": {
          "contig": "1",
          "position": 904165
        },
        "alleles": [
          "G",
          "A"
        ]
      },
      "end": {
        "locus": {
          "contig": "X",
          "position": 154087368
        },
        "alleles": [
          "T",
          "A"
        ]
      },
      "includeStart": true,
      "includeEnd": true
    }
  ],

for a matrix table with a single partition.

1 Like

Hi again! Thanks for the tip about the metadata file. I just realized I have another question related to this topic. I forgot that I have a sparse MT that gets densified prior to export, and I was checking the positions in the partitions of the sparse MT to get the dense VCF shard start/stops. Is there still a fast/easy way to get the positions without having to checkpoint the full dense MT?

We plan to write something that doesn’t rely on internal details, but for now this should work:

def part_min_and_max(part):
    keys = part.map(lambda x: x.select('locus', 'alleles'))
    return hl.struct(start=keys[0], end=keys[-1])

ht._map_partitions(lambda p: hl.array([part_min_and_max(p)])).collect()
1 Like

thank you!!!

I forgot to add this update – posting here in case anyone else also has this issue.

I ran into this error with the part_min_and_max function:

ValueError: Table._map_partitions must preserve key fields

unkeying the HT before calling part_min_and_max fixed this for me:

# This throws an error
# `ht` is keyed by locus and alleles
ht = hl.read_table(path)
ht._map_partitions(lambda p: hl.array([part_min_and_max(p)])).collect()

# This fixes the issue
ht = ht.key_by()
ht._map_partitions(lambda p: hl.array([part_min_and_max(p)])).collect()

looking back on this, just curious, is there a better way to get around the must preserve key fields error?

The rule is that the return type of map_partitions has to be an array whose entries are structs with the same key as the original ht. So if ht is keyed by locus and alleles, then to make this work part_min_and_max would have to return a struct with locus and alleles fields. We’d also expect that they are still in sorted order.

1 Like

I see, thanks!!

I have another basic question – I just ran into this error with this function (part_min_and_max):

  File "/tmp/25aa70a4f6c747838d0efc284b465c92/prepare_vcf_data_release.py", line 974, in <module>
    main(args)
  File "/tmp/25aa70a4f6c747838d0efc284b465c92/prepare_vcf_data_release.py", line 845, in main
    start_stop_list = ht._map_partitions(
  File "<decorator-gen-1127>", line 2, in collect
  File "/opt/conda/default/lib/python3.8/site-packages/hail/typecheck/check.py", line 577, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/table.py", line 1920, in collect
    return Env.backend().execute(e._ir)
  File "/opt/conda/default/lib/python3.8/site-packages/hail/backend/py4j_backend.py", line 108, in execute
    raise HailUserError(message_and_trace) from None
hail.utils.java.HailUserError: Error summary: HailException: array index out of bounds: index=0, length=0
------------
Hail stack trace:
  File "/tmp/25aa70a4f6c747838d0efc284b465c92/prepare_vcf_data_release.py", line 974, in <module>
    main(args)

  File "/tmp/25aa70a4f6c747838d0efc284b465c92/prepare_vcf_data_release.py", line 845, in main
    start_stop_list = ht._map_partitions(

  File "/opt/conda/default/lib/python3.8/site-packages/hail/table.py", line 3518, in _map_partitions
    body = f(expr)

  File "/tmp/25aa70a4f6c747838d0efc284b465c92/prepare_vcf_data_release.py", line 846, in <lambda>
    lambda p: hl.array([part_min_and_max(p)])

  File "/tmp/25aa70a4f6c747838d0efc284b465c92/prepare_vcf_data_release.py", line 843, in part_min_and_max
    return hl.struct(start=keys[0], end=keys[-1])

  File "/opt/conda/default/lib/python3.8/site-packages/hail/expr/expressions/typed_expressions.py", line 834, in __getitem__
    return super().__getitem__(item)

  File "/opt/conda/default/lib/python3.8/site-packages/hail/expr/expressions/typed_expressions.py", line 481, in __getitem__
    return self._method("indexArray", self.dtype.element_type, item)

  File "/opt/conda/default/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py", line 596, in _method
    x = ir.Apply(name, ret_type, self._ir, *(a._ir for a in args))

  File "/opt/conda/default/lib/python3.8/site-packages/hail/ir/ir.py", line 2263, in __init__
    self.save_error_info()

ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [25aa70a4f6c747838d0efc284b465c92] failed with error:
Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at:
https://console.cloud.google.com/dataproc/jobs/25aa70a4f6c747838d0efc284b465c92?project=maclab-ukbb&region=us-central1
gcloud dataproc jobs wait '25aa70a4f6c747838d0efc284b465c92' --region 'us-central1' --project 'maclab-ukbb'
https://console.cloud.google.com/storage/browser/dataproc-1aca38e4-67fe-4b64-b451-258ef1aea4d1-us-central1/google-cloud-dataproc-metainfo/3bfd808f-a289-49a2-9626-f1ead2ba216c/jobs/25aa70a4f6c747838d0efc284b465c92/
gs://dataproc-1aca38e4-67fe-4b64-b451-258ef1aea4d1-us-central1/google-cloud-dataproc-metainfo/3bfd808f-a289-49a2-9626-f1ead2ba216c/jobs/25aa70a4f6c747838d0efc284b465c92/driveroutput
Traceback (most recent call last):
  File "/Users/kchao/anaconda3/envs/hail/bin/hailctl", line 8, in <module>
    sys.exit(main())
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/__main__.py", line 100, in main
    cli.main(args)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/cli.py", line 122, in main
    jmp[args.module].main(args, pass_through_args)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/submit.py", line 78, in main
    gcloud.run(cmd)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/site-packages/hailtop/hailctl/dataproc/gcloud.py", line 9, in run
    return subprocess.check_call(["gcloud"] + command)
  File "/Users/kchao/anaconda3/envs/hail/lib/python3.7/subprocess.py", line 328, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'dataproc', 'jobs', 'submit', 'pyspark', 'prepare_vcf_data_release.py', '--cluster=chr9', '--files=', '--py-files=/var/folders/xq/8jnhrt2s2h58ts2v0br5g8gm0000gp/T/pyscripts_y547yk4z.zip', '--properties=', '--', '--prepare_release_vcf', '--slack_channel', '@kc (she/her)', '--contig', 'chr9']' returned non-zero exit status 1.

What would cause this to throw an array index out of bounds error? Here’s the code: ukbb_qc/prepare_vcf_data_release.py at freeze_7 · broadinstitute/ukbb_qc · GitHub (will email the log)