Hail gVCF bgzip files are getting sorted?

Vipul_Patel · June 3, 2025, 12:19pm

Hi

I am currently playing around with hail and read up quite a bit of here and there to make it runnable, solving issues from not finding the file or how to create the storing the result.

The issue, which I face now, is the forcing of resorting the vcf before I can run combine.run().
How to I can avoid this step, as the gVCF should be already in a sorted state, the tabix file is generated and located next to it.
Is this a problem that the files on UKB are not ending on .vcf[.bgz, .gz] ?
Any clue, tip is highly appreciated

I currently testing only with one sample, one chromosome with 16cpu, to keep the cost low for testing.

combiner = hl.vds.new_combiner(
    output_path=f'dnax://{my_database}/vds_result',
    temp_path=f'dnax://{my_database}/tmp/combiner',
    gvcf_paths=test_vcf[0],
    use_genome_default_intervals=True,
    reference_genome='GRCh38'
)

chrisvittal · June 3, 2025, 2:41pm

No, this is fine.

I would recommend testing the combiner with at least 2 inputs.

The message you’re seeing is indicating that the VCF is sorted.

What’s going on here is that we scan a single GVCF to determine what fields should be in the reference data. If you set, gvcf_reference_entry_fields_to_keep in new_combiner these messages should disappear.

Vipul_Patel · June 3, 2025, 2:48pm

@chrisvittal thanks a lot for clarifications!

By setting the gvcf_reference_entry_fields_to_keep field, would that also reduce then in general the run time or this is more bound to the number of cpu for the inital scan?

chrisvittal · June 3, 2025, 3:07pm

The runtime is dominated by the actual combine work. Anything you see before running combiner.run() is setup/metadata collection. Setting gvcf_reference_entry_fields_to_keep just avoids a scan that determines the defined reference fields by explicitly requesting the fields you want to keep.

Vipul_Patel · June 3, 2025, 5:25pm

Thank you again @chrisvittal .

I know this is going now out of the title. I will change it, if I can. But would you be so kind to explain maybe here further? The question would be for me: how I could break down this preparation phase, if actually possible?
Currently it took using the 16 cpu 3.1h to setup/metadata collection. I am not sure if this normal expectation time .

Vipul_Patel · June 6, 2025, 7:37am

Any tip, recommendation or where to look is highly apricated.

chrisvittal · June 6, 2025, 4:18pm

What are you seeing? It sounds like the actual import and combine work is proceeding rather than the small metadata gathering that the combiner performs.

Vipul_Patel · June 10, 2025, 8:09am

Sorry for the late reply, we had some public holidays here.
Anyway here is what I have observed:

when I was running

combiner = hl.vds.new_combiner(
output_path=f’dnax://{my_database}/vds_result’,
temp_path=f’dnax://{my_database}/tmp/combiner’,
gvcf_paths=test_vcf[0],
use_genome_default_intervals=True,
reference_genome=‘GRCh38’
)

I saw on the control panel for the spark cluster that around 52 jobs were submitted and each took 1h running time. I didn´t got any log out put during this time.

 combiner.run()

Afterwards I executed the combiner.run() which send out over around 12000 jobs.

Based on message regarding sorting, I thought actually previous 3h the function tried to resort the sorted gvcf.

Is there a way to see what is ongoing during the initial phase?

chrisvittal · June 11, 2025, 6:15pm

There isn’t really. I’ve added a bit more logging so that in the future, there’s more visibility into what new_combiner is doing. Thank you for the feedback.

Topic		Replies	Views
Possible vcf_combiner issue Hail Query & hailctl	19	1242	June 15, 2020
I cannot import VCF to Hail Hail Query & hailctl	21	537	July 26, 2022
Combine multiple vcf files Hail Query & hailctl	1	1286	November 30, 2020
Large gVCF into VDS Hail Query & hailctl	21	206	June 11, 2025
VCF Combiner Error Hail Query & hailctl	2	455	April 5, 2021

Hail gVCF bgzip files are getting sorted?

Related topics