Hail 0.2 help for warning message

trptyrphe · March 14, 2019, 9:00pm

Hi, I was trying to trouble shoot why the varaint_qc and to_pandas step takes very long time (I was trying to filter variants to 5 positions -> filter samples to a list of 20K -> variant qc -> export). There’s a warning message I don’t quite get it:

2019-03-14 20:29:58 Hail: WARN: Found differences between requested and initialized parameters. Ignoring requested parameters.
Param: tmpDir, Provided value: /tmp, Existing value: dbfs:/tmp/hail.MfvZ1EpfplvG
Param: minBlockSize, Provided value: 1, Existing value: 0
2019-03-14 20:30:00 Hail: INFO: Number of BGEN files parsed: 1
2019-03-14 20:30:00 Hail: INFO: Number of samples in BGEN files: 487409
2019-03-14 20:30:00 Hail: INFO: Number of variants across all BGEN files: 5751712
2019-03-14 20:30:39 Hail: INFO: interval filter loaded 8 of 18000 partitions
2019-03-14 20:30:41 Hail: INFO: Coerced sorted dataset

My code:
vds_result = hl.filter_intervals(mt,target_intervals_parsed).cache()
merged = vds_result.filter_cols(hl.literal(subject_array).contains(vds_result.s))
merged_variant_qc = hl.variant_qc(merged)
merged_variant_qc_pd = merged_variant_qc.rows().to_pandas()

Thanks.

tpoterba · March 18, 2019, 11:58pm

Where are you running Hail? You can see this error message when hl.init() is called with the idempotent=True argument, with different parameters otherwise.

How long is this taking? I’d expect it to take a minute or so.

tpoterba · March 18, 2019, 11:59pm

ah, wait:

hl.literal(subject_array).contains(vds_result.s))

should probably be:

hl.literal(set(subject_array)).contains(vds_result.s))

trptyrphe · March 19, 2019, 1:07pm

I am running Hail using databricks (runtime HLS 5.3 beta). The filter_cols step is fast, it’s the next step: variant_qc took forever to complete (5 variants, 20K subjects). Thanks.

tpoterba · March 19, 2019, 1:35pm

What hail version are you using? hl.__version__

trptyrphe · March 19, 2019, 1:37pm

hail 0.2: https://docs.databricks.com/applications/genomics/hail.html#hail-02

tpoterba · March 19, 2019, 1:37pm

I need the minor version – before we dig in at all, I want to make sure it’s recent.

trptyrphe · March 19, 2019, 1:44pm

0.2.5-44045f72f11c

tpoterba · March 19, 2019, 1:50pm

Did you try my change to the literal?

tpoterba · March 19, 2019, 1:50pm

note that even if the filter_cols line is fast, that could still be the problem – Hail is lazy and executes the entire pipeline together at the end.

trptyrphe · March 19, 2019, 2:06pm

The list array only contains unique elements. I understand it’s the variant_qc step to execute, so does it meet the expectation to be slow (hours not complete) for 5 variants x 20K samples? I just need a feasible way to perform the variant qc and export the result. Thanks.

tpoterba · March 19, 2019, 3:53pm

It needs to be a set not for uniqueness, but for performance – .contains on an array is O(n), while .contains on a set is O(log(n)).

See the warning here:

https://hail.is/docs/0.2/expressions.html#hail.expr.expressions.ArrayExpression.contains

trptyrphe · March 21, 2019, 3:29pm

Still very slow with the set(). And if I break down the steps, it is step (B) the bottleneck, is there more efficient way to look up the variant qc results than the rows().to_pandas()? Thanks.

A. merged_variant_qc = hl.variant_qc(merged)

B. merged_variant_qc_pd = merged_variant_qc.rows().to_pandas()

tpoterba · March 21, 2019, 3:31pm

to_pandas uses a Spark protocol that is known to be very slow. If you try merged_variant_qc.write('...') (and read and to_pandas that result) then that’s the best measurement.

Can you run again and give us the Hail log file? You don’t need to let it run to completion, just let it get to the slow part.

trptyrphe · March 21, 2019, 3:33pm

where can I find the hail log?

tpoterba · March 21, 2019, 3:34pm

It should be echoed in initialization:

In [1]: hl.init()
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 2.2.0
SparkUI available at http://10.1.0.166:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.11-e8bbc49d0ae2
LOGGING: writing to /Users/tpoterba/hail/hail/hail-20190321-1134-0.2.11-e8bbc49d0ae2.log

tpoterba · March 21, 2019, 3:34pm

oh, wait, you’re using Databricks, and I think they do some weird stuff related to setup

tpoterba · March 21, 2019, 3:36pm

try this:

In [3]: hl.utils.java.Env.hc()._log
Out[3]: '/Users/tpoterba/hail/hail/hail-20190321-1134-0.2.11-e8bbc49d0ae2.log'
``

trptyrphe · March 21, 2019, 3:45pm

Actually both ways work for identifying log file path. Can I send you the log directly (it’s too large to copy paste). Thanks.

tpoterba · March 21, 2019, 3:45pm

Yes, please send to hail-team@broadinstitute.org

Topic		Replies	Views
Poor performance for QC filtering on medium sized genotype data Hail Query & hailctl	20	2167	February 8, 2020
ArrayIndexOutOfBoundsException Hail Query & hailctl	22	1228	November 21, 2019
Fail to retrieve row information of Hail matrix.table Hail Query & hailctl	5	523	July 22, 2022
Help for finding rare variants for 100 patients Hail Query & hailctl	10	441	March 20, 2023
Error summary: OutOfMemoryError: Java heap space Hail Query & hailctl	15	2594	August 18, 2022

Hail 0.2 help for warning message

Related topics