Export VCF produced empty shard (header only)

Hi hail team!

I was looking into why I got this error and noticed that hl.export_vcf exported an empty VCF shard (header only, no data).

For more context, I was trying to check the first and last position of every partition in a table so that I could report the start and end position of each VCF shard. When I ran the code to get the positions in each table partition, it crashed with an array index out of bounds error:

hail.utils.java.HailUserError: Error summary: HailException: array index out of bounds: index=0, length=0

I was stumped as to why I saw this error but opened the first VCF shard (part-00000.bgz) and noticed that it only had a header. I’ve checked a few other shards, and they all appear to have data. Is it expected that export_vcf might export an empty shard?

Here is the code: ukbb_qc/prepare_vcf_data_release.py at freeze_7 · broadinstitute/ukbb_qc · GitHub (I also emailed the log associated with this job/error on 11/3).

When I ran the code to get the positions in each table partition, it crashed with an array index out of bounds error:

This means there’s an empty partition somewhere. That’s probably why you have a header-only VCF too.

Did you just read/export a matrixtable? Because the partition counts in the matrix metadata could help here.

Yup read/exported a matrixtable.

It’s weird that there’s an empty partition – I ran a couple naive_coalesces before exporting. Maybe I missed something?

            # Read in MT and filter to contig
            # Repartition to a large number of partitions here so that the chromosomes have closer to the desired number of shards
            mt = hl.read_matrix_table(
                ukb_vcf_mt_path, _n_partitions=40000,
            ).select_rows()

            if args.test:
                mt = mt._filter_partitions(range(2))
            mt = hl.filter_intervals(mt, [hl.parse_locus_interval(contig)])
            mt = mt.annotate_rows(**ht[mt.row_key])

            logger.info("Adjusting partitions...")
            mt = mt.naive_coalesce(n_partitions)
            logger.info("%s has %i partitions", contig, mt.n_partitions())

            logger.info("Densifying...")
            mt = hl.experimental.densify(mt)
            # Drop END and het non ref to avoid exporting
            mt = mt.drop("het_non_ref")

            logger.info("Removing low QUAL variants and * alleles...")
            info_ht = hl.read_table(info_ht_path(data_source, freeze))
            mt = mt.filter_rows(
                (~info_ht[mt.row_key].AS_lowqual)
                & ((hl.len(mt.alleles) > 1) & (mt.alleles[1] != "*"))
            )

            logger.info("Adjusting partitions again post-densify...")
            mt = mt.naive_coalesce(n_partitions)
            logger.info("%s has %i partitions post-densify", contig, mt.n_partitions())
            ht = mt.rows()
            # Unkey HT to avoid this error with map_partitions:
            # ValueError: Table._map_partitions must preserve key fields
            ht = ht.key_by()

            logger.info("Adjusting sex ploidy...")
            mt = adjust_sex_ploidy(mt, mt.sex_karyotype, male_str="XY", female_str="XX")
            mt = mt.select_cols()

            logger.info("Exporting VCF...")
            hl.export_vcf(
                mt,
                release_vcf_path(*tranche_data, contig=contig),
                metadata=header_dict,
                append_to_header=append_to_vcf_header_path(*tranche_data),
                parallel="header_per_shard",
                # Removed tabix here because the repackaging code produces an index
            )

            logger.info("Getting start and stops per shard...")

            def part_min_and_max(part):
                keys = part.map(lambda x: x.select("locus", "alleles"))
                return hl.struct(start=keys[0], end=keys[-1])

            start_stop_list = ht._map_partitions(
                lambda p: hl.array([part_min_and_max(p)])
            ).collect()
            print(start_stop_list)
            with hl.hadoop_open(
                f"gs://broad-ukbb/broad.freeze_7/release/vcf_positions_tsvs/{contig}_start_end_pos.tsv",
                "w",
            ) as o:
                o.write("vcf_shard_name\tstart_pos\tend_pos\n")
                for count, struct in enumerate(start_stop_list):
                    # Format shard name to have leading zeros
                    # e.g., 0 should be part-000000.bgz
                    shard_name = f"part-{count:05d}.bgz"
                    o.write(
                        f"{shard_name}\t{struct.start.locus.position}\t{struct.end.locus.position}\n"
                    )

Naive_coalesce just groups adjacent partitions, it doesn’t try to balance them by row count. I think there could certainly be runs of empty partitions before/after these filters.

ooh. thanks!