Hi,
I’m running a small hail program that loads a VCF file, annotates it with the GnomAD 4.1 allele frequency data and then exports to a CSV file. I can provide the code but haven’t here for brevity.
The VCF is pretty tiny, it’s only a couple of variants that I’m using for development purposes before running this on WGS.
In order to facilitate local development, I’ve got Hail installed in Docker, along with my script and run everything in 1 container. The GnomAD reference data is mounted from a network onto my local machine and then that mount that into Docker.
When I run, everything runs fine sometimes but I’m frequently see hangs when I think Hail is starting tasks. So I see a bunch of logs scroll past in the hail log file and then see hangs around areas like
2024-09-25 12:19:06.367 : INFO: instruction count: 3: __C1174staticWrapperClass_1.
2024-09-25 12:19:06.368 : INFO: instruction count: 3: __C1184RGContainer_GRCh38.
2024-09-25 12:19:06.369 : INFO: instruction count: 3: __C1184RGContainer_GRCh38.
2024-09-25 12:19:06.801 : INFO: encoder cache hit
2024-09-25 12:19:07.250 : INFO: RegionPool: initialized for thread 65: Executor task launch worker for task 0.0 in stage 2.0 (TID 2)
2024-09-25 12:19:07.270 : INFO: RegionPool: REPORT_THRESHOLD: 320.0K allocated (192.0K blocks / 128.0K chunks), regions.size = 3, 0 current java objects, thread 65: Executor task launch worker for task 0.0 in stage 2.0 (TID 2)
2024-09-25 12:19:07.276 : INFO: RegionPool: REPORT_THRESHOLD: 512.0K allocated (320.0K blocks / 192.0K chunks), regions.size = 3, 0 current java objects, thread 65: Executor task launch worker for task 0.0 in stage 2.0 (TID 2)
2024-09-25 12:19:07.283 : INFO: RegionPool: REPORT_THRESHOLD: 1.0M allocated (832.0K blocks / 192.0K chunks), regions.size = 3, 0 current java objects, thread 65: Executor task launch worker for task 0.0 in stage 2.0 (TID 2)
2024-09-25 12:19:07.311 : INFO: RegionPool: REPORT_THRESHOLD: 2.0M allocated (1.8M blocks / 192.0K chunks), regions.size = 5, 0 current java objects, thread 65: Executor task launch worker for task 0.0 in stage 2.0 (TID 2)
CPU drops to minimal consumption and just nothing seems to happen.
There doesn’t seem to be a lot in the logs files about what’s going wrong and with the CPU dropping to <1% utilisation I’m not thinking that its just running slow and getting stuff done.
Hi Peter,
I’m not sure what’s happening here. I do have one theory though. Even if the VCF is tiny, you’re joining it with a large table. To do this on a single machine, Hail needs to be able to avoid reading the entire GnomAD table. Due to the way Hail works, it can’t do this when directly annotating after importing the VCF. But if you write the imported VCF as a Hail MatrixTable first, then read it back in and annotate, it should be able to read only the relevant parts of the GnomAD table (this is because we create an index when writing a MatrixTable).
That’s the first thing I would try. Let us know if you still see the same behavior.
Hi Patrick, thanks for your response. I seem to of overcome my immediate problems.
We’re using containers to run hail and embedded backend sparc system but are running these on AWS Batch, so potentially are running on much larger machines than the size of the container.
So with the Java 11 API reporting the number of CPUs and Memory of the entire machine rather than just that available to the container I think the sparc cluster was picking completely unrealistic defaults for the number of CPUs and Memory to use.
I can reliably run things now that I’ve added
hl.init(global_seed=0, master=“local[1]”, spark_conf={‘spark.driver.memory’: ‘4g’})
Now that it’s running I can tweak the numbers and amount of CPU and memory in the container to see what scale I can run at. I will get round to that I’m busy fiddling with IO throughput with the AWS Batch system. It looks like that hail is fairly efficient and generates heavier IO than our other processing pipelines. Which given we’re trying to annotate VCF files with GnomAD 4.1 data is about what I would expect for a well constructed system.
I’m interested in your comment in terms of importing the VCF file and then converting to a hail matrix table before annotating. It sounds like this could reduce our IO load and generally be more efficient. At the moment I’m processing WGS but would certainly expect to process WES or Panels so this looks like a good optimisation.
My code currently looks something like (I’m omitting the code which finds relevant paths and just left in the hail calls)
hl.init(global_seed=0, master=“local[1]”, spark_conf={‘spark.driver.memory’: ‘4g’})
…
af_values = hl.read_table(hail_table_path)
af_values = af_values.filter(af_values.fail_hard_filters == False)
sample = hl.import_vcf(vcf_path, reference_genome=assembly, force_bgz=True)
allelics = hl.split_multi(sample)
results = allelics
results = results.annotate_rows(FREQ=af_values[results.row_key].freq)
data_to_export = results.key_rows_by(*results.row)
data_to_export.entry.export(str(csv_output))
So the split_multi call does return a VariantDataSet which contains a MatrixTable, but I’m not sure if you think I should export the allelics entity and then re-import?
Thanks
Pete