Hi
I am currently playing around with hail and read up quite a bit of here and there to make it runnable, solving issues from not finding the file or how to create the storing the result.
The issue, which I face now, is the forcing of resorting the vcf before I can run combine.run()
.
How to I can avoid this step, as the gVCF should be already in a sorted state, the tabix file is generated and located next to it.
Is this a problem that the files on UKB are not ending on .vcf[.bgz, .gz] ?
Any clue, tip is highly appreciated
I currently testing only with one sample, one chromosome with 16cpu, to keep the cost low for testing.
combiner = hl.vds.new_combiner(
output_path=f'dnax://{my_database}/vds_result',
temp_path=f'dnax://{my_database}/tmp/combiner',
gvcf_paths=test_vcf[0],
use_genome_default_intervals=True,
reference_genome='GRCh38'
)
No, this is fine.
I would recommend testing the combiner with at least 2 inputs.
The message you’re seeing is indicating that the VCF is sorted.
What’s going on here is that we scan a single GVCF to determine what fields should be in the reference data. If you set, gvcf_reference_entry_fields_to_keep
in new_combiner
these messages should disappear.
@chrisvittal thanks a lot for clarifications!
By setting the gvcf_reference_entry_fields_to_keep
field, would that also reduce then in general the run time or this is more bound to the number of cpu for the inital scan?
The runtime is dominated by the actual combine work. Anything you see before running combiner.run()
is setup/metadata collection. Setting gvcf_reference_entry_fields_to_keep
just avoids a scan that determines the defined reference fields by explicitly requesting the fields you want to keep.
Thank you again @chrisvittal .
I know this is going now out of the title. I will change it, if I can. But would you be so kind to explain maybe here further? The question would be for me: how I could break down this preparation phase, if actually possible?
Currently it took using the 16 cpu 3.1h to setup/metadata collection. I am not sure if this normal expectation time
.
Any tip, recommendation or where to look is highly apricated.
What are you seeing? It sounds like the actual import and combine work is proceeding rather than the small metadata gathering that the combiner performs.
Sorry for the late reply, we had some public holidays here.
Anyway here is what I have observed:
when I was running
combiner = hl.vds.new_combiner(
output_path=f’dnax://{my_database}/vds_result’,
temp_path=f’dnax://{my_database}/tmp/combiner’,
gvcf_paths=test_vcf[0],
use_genome_default_intervals=True,
reference_genome=‘GRCh38’
)
I saw on the control panel for the spark cluster that around 52 jobs were submitted and each took 1h running time. I didn´t got any log out put during this time.
combiner.run()
Afterwards I executed the combiner.run() which send out over around 12000 jobs.
Based on message regarding sorting, I thought actually previous 3h the function tried to resort the sorted gvcf.
Is there a way to see what is ongoing during the initial phase?
There isn’t really. I’ve added a bit more logging so that in the future, there’s more visibility into what new_combiner
is doing. Thank you for the feedback.
1 Like