Yep, thank you! I am running some other analyses right now but I’ll post a log tomorrow if I still have issues upon rerunning.
@tpoterba it looks like upload_log was removed from the codebase
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-93dcd1342068> in <module>
----> 1 hl.upload_log(bucket+"/pca_run.log")
AttributeError: module 'hail' has no attribute 'upload_log'
I found this issue that was closed in April of this year: https://github.com/hail-is/hail/issues/7392
Notably, since using 0.2.49 I was successfully able to run hl.ld_prune() (runtime was approx 4hrs), which brought my variant row count down from 580k to 340k with r2=0.2. Previously this function produced a SparkException so we are getting somewhere!
Even with another overnight run, the hl.hwe_normalized_pca() looks like it is a stuck job.
Happy to generate a log file for this if there is a way.
Is the logging from the hl.init() statement the same log? Looks like I can access that from the terminal of the Terra virtual machine and it looks quite large now.
Oops, sorry, got confused – hl.copy_log is what I meant to point you to. That’s a convenience around using another utility to copy the log from the local disk of the driver machine to a Google bucket / remote FS.
Yeah, the log on that machine is what we want – I do expect it to be somewhat large.
Should I post in here?
Probably won’t fit. Can you email as attachment/drive link to hail-team@broadinstitute.org?
For anyone else following, wanted to post an update on this. I finally was able to run a PCA job after some help from Tim.
We found that my MatrixTable, assembled via import_plink() was not partitioned very efficiently ( you can check this via mt.n_partitions().
We were able to both repartition the data and save/checkpoint as a MatrixTable for added efficiency with
gen = gen.repartition(250).checkpoint('gs://some/[path.mt](http://path.mt)', overwrite=True)
I did have some trouble with the shuffling step of the repartition due to using preemptible nodes, so Tim provided me with the following code for a “no shuffle” repartition:
def no_shuffle_repartition(mt, path1, path2, n_parts):
mt = mt.checkpoint(path1)
return hl.read_matrix_table(path1, _intervals=mt._calculate_new_partitions(n_parts)).checkpoint(path2)
```