PCA job aborted from SparkException

acererak · July 15, 2020, 7:10pm

Yep, thank you! I am running some other analyses right now but I’ll post a log tomorrow if I still have issues upon rerunning.

acererak · July 16, 2020, 1:12pm

@tpoterba it looks like upload_log was removed from the codebase

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-93dcd1342068> in <module>
----> 1 hl.upload_log(bucket+"/pca_run.log")

AttributeError: module 'hail' has no attribute 'upload_log'

I found this issue that was closed in April of this year: https://github.com/hail-is/hail/issues/7392

Notably, since using 0.2.49 I was successfully able to run hl.ld_prune() (runtime was approx 4hrs), which brought my variant row count down from 580k to 340k with r2=0.2. Previously this function produced a SparkException so we are getting somewhere!

Even with another overnight run, the hl.hwe_normalized_pca() looks like it is a stuck job.

Happy to generate a log file for this if there is a way.

acererak · July 16, 2020, 1:14pm

Is the logging from the hl.init() statement the same log? Looks like I can access that from the terminal of the Terra virtual machine and it looks quite large now.

tpoterba · July 16, 2020, 1:25pm

Oops, sorry, got confused – hl.copy_log is what I meant to point you to. That’s a convenience around using another utility to copy the log from the local disk of the driver machine to a Google bucket / remote FS.

Yeah, the log on that machine is what we want – I do expect it to be somewhat large.

acererak · July 16, 2020, 1:26pm

Should I post in here?

tpoterba · July 16, 2020, 1:28pm

Probably won’t fit. Can you email as attachment/drive link to hail-team@broadinstitute.org?

acererak · July 28, 2020, 7:29pm

For anyone else following, wanted to post an update on this. I finally was able to run a PCA job after some help from Tim.

We found that my MatrixTable, assembled via import_plink() was not partitioned very efficiently ( you can check this via mt.n_partitions().

We were able to both repartition the data and save/checkpoint as a MatrixTable for added efficiency with

gen = gen.repartition(250).checkpoint('gs://some/[path.mt](http://path.mt)', overwrite=True)

I did have some trouble with the shuffling step of the repartition due to using preemptible nodes, so Tim provided me with the following code for a “no shuffle” repartition:

def no_shuffle_repartition(mt, path1, path2, n_parts):
     mt = mt.checkpoint(path1)
     return hl.read_matrix_table(path1, _intervals=mt._calculate_new_partitions(n_parts)).checkpoint(path2)
```

Topic		Replies	Views
Error summary: OutOfMemoryError: Java heap space Hail Query & hailctl	15	2605	August 18, 2022
Pc_rel memory issue: ConnectionRefusedError: [Errno 111] Connection refused Hail Query & hailctl	10	792	June 11, 2024
Export_vcf OutOfMemoryError: Java heap space despite --driver-memory 8g Hail Query & hailctl	26	2856	January 11, 2019
Hl.maximal_independent_set - job 'cancelled because SparkContext was shut down' Hail Query & hailctl	28	7739	February 4, 2021
Running Hail on Terra -- how should I optimize? Hail Query & hailctl	7	1251	February 3, 2021

PCA job aborted from SparkException

Related topics