I was looking into why I got this error and noticed that hl.export_vcf exported an empty VCF shard (header only, no data).
For more context, I was trying to check the first and last position of every partition in a table so that I could report the start and end position of each VCF shard. When I ran the code to get the positions in each table partition, it crashed with an array index out of bounds error:
hail.utils.java.HailUserError: Error summary: HailException: array index out of bounds: index=0, length=0
I was stumped as to why I saw this error but opened the first VCF shard (part-00000.bgz) and noticed that it only had a header. I’ve checked a few other shards, and they all appear to have data. Is it expected that export_vcf might export an empty shard?
It’s weird that there’s an empty partition – I ran a couple naive_coalesces before exporting. Maybe I missed something?
# Read in MT and filter to contig
# Repartition to a large number of partitions here so that the chromosomes have closer to the desired number of shards
mt = hl.read_matrix_table(
ukb_vcf_mt_path, _n_partitions=40000,
).select_rows()
if args.test:
mt = mt._filter_partitions(range(2))
mt = hl.filter_intervals(mt, [hl.parse_locus_interval(contig)])
mt = mt.annotate_rows(**ht[mt.row_key])
logger.info("Adjusting partitions...")
mt = mt.naive_coalesce(n_partitions)
logger.info("%s has %i partitions", contig, mt.n_partitions())
logger.info("Densifying...")
mt = hl.experimental.densify(mt)
# Drop END and het non ref to avoid exporting
mt = mt.drop("het_non_ref")
logger.info("Removing low QUAL variants and * alleles...")
info_ht = hl.read_table(info_ht_path(data_source, freeze))
mt = mt.filter_rows(
(~info_ht[mt.row_key].AS_lowqual)
& ((hl.len(mt.alleles) > 1) & (mt.alleles[1] != "*"))
)
logger.info("Adjusting partitions again post-densify...")
mt = mt.naive_coalesce(n_partitions)
logger.info("%s has %i partitions post-densify", contig, mt.n_partitions())
ht = mt.rows()
# Unkey HT to avoid this error with map_partitions:
# ValueError: Table._map_partitions must preserve key fields
ht = ht.key_by()
logger.info("Adjusting sex ploidy...")
mt = adjust_sex_ploidy(mt, mt.sex_karyotype, male_str="XY", female_str="XX")
mt = mt.select_cols()
logger.info("Exporting VCF...")
hl.export_vcf(
mt,
release_vcf_path(*tranche_data, contig=contig),
metadata=header_dict,
append_to_header=append_to_vcf_header_path(*tranche_data),
parallel="header_per_shard",
# Removed tabix here because the repackaging code produces an index
)
logger.info("Getting start and stops per shard...")
def part_min_and_max(part):
keys = part.map(lambda x: x.select("locus", "alleles"))
return hl.struct(start=keys[0], end=keys[-1])
start_stop_list = ht._map_partitions(
lambda p: hl.array([part_min_and_max(p)])
).collect()
print(start_stop_list)
with hl.hadoop_open(
f"gs://broad-ukbb/broad.freeze_7/release/vcf_positions_tsvs/{contig}_start_end_pos.tsv",
"w",
) as o:
o.write("vcf_shard_name\tstart_pos\tend_pos\n")
for count, struct in enumerate(start_stop_list):
# Format shard name to have leading zeros
# e.g., 0 should be part-000000.bgz
shard_name = f"part-{count:05d}.bgz"
o.write(
f"{shard_name}\t{struct.start.locus.position}\t{struct.end.locus.position}\n"
)
Naive_coalesce just groups adjacent partitions, it doesn’t try to balance them by row count. I think there could certainly be runs of empty partitions before/after these filters.