Hi, I was trying to trouble shoot why the varaint_qc and to_pandas step takes very long time (I was trying to filter variants to 5 positions -> filter samples to a list of 20K -> variant qc -> export). There’s a warning message I don’t quite get it:
2019-03-14 20:29:58 Hail: WARN: Found differences between requested and initialized parameters. Ignoring requested parameters.
Param: tmpDir, Provided value: /tmp, Existing value: dbfs:/tmp/hail.MfvZ1EpfplvG
Param: minBlockSize, Provided value: 1, Existing value: 0
2019-03-14 20:30:00 Hail: INFO: Number of BGEN files parsed: 1
2019-03-14 20:30:00 Hail: INFO: Number of samples in BGEN files: 487409
2019-03-14 20:30:00 Hail: INFO: Number of variants across all BGEN files: 5751712
2019-03-14 20:30:39 Hail: INFO: interval filter loaded 8 of 18000 partitions
2019-03-14 20:30:41 Hail: INFO: Coerced sorted dataset
vds_result = hl.filter_intervals(mt,target_intervals_parsed).cache()
merged = vds_result.filter_cols(hl.literal(subject_array).contains(vds_result.s))
merged_variant_qc = hl.variant_qc(merged)
merged_variant_qc_pd = merged_variant_qc.rows().to_pandas()
Where are you running Hail? You can see this error message when
hl.init() is called with the
idempotent=True argument, with different parameters otherwise.
How long is this taking? I’d expect it to take a minute or so.
should probably be:
I am running Hail using databricks (runtime HLS 5.3 beta). The filter_cols step is fast, it’s the next step: variant_qc took forever to complete (5 variants, 20K subjects). Thanks.
What hail version are you using?
I need the minor version – before we dig in at all, I want to make sure it’s recent.
Did you try my change to the literal?
note that even if the filter_cols line is fast, that could still be the problem – Hail is lazy and executes the entire pipeline together at the end.
The list array only contains unique elements. I understand it’s the variant_qc step to execute, so does it meet the expectation to be slow (hours not complete) for 5 variants x 20K samples? I just need a feasible way to perform the variant qc and export the result. Thanks.
It needs to be a set not for uniqueness, but for performance –
.contains on an array is O(n), while .contains on a set is O(log(n)).
See the warning here:
Still very slow with the set(). And if I break down the steps, it is step (B) the bottleneck, is there more efficient way to look up the variant qc results than the rows().to_pandas()? Thanks.
A. merged_variant_qc = hl.variant_qc(merged)
B. merged_variant_qc_pd = merged_variant_qc.rows().to_pandas()
to_pandas uses a Spark protocol that is known to be very slow. If you try
merged_variant_qc.write('...') (and read and to_pandas that result) then that’s the best measurement.
Can you run again and give us the Hail log file? You don’t need to let it run to completion, just let it get to the slow part.
where can I find the hail log?
It should be echoed in initialization:
In : hl.init()
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 2.2.0
SparkUI available at http://10.1.0.166:4040
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.11-e8bbc49d0ae2
LOGGING: writing to /Users/tpoterba/hail/hail/hail-20190321-1134-0.2.11-e8bbc49d0ae2.log
oh, wait, you’re using Databricks, and I think they do some weird stuff related to setup
In : hl.utils.java.Env.hc()._log
Actually both ways work for identifying log file path. Can I send you the log directly (it’s too large to copy paste). Thanks.