Py4JError: An error occurred while calling o1.pyPersistTable when running hl.hwe_normalized_pca()

jsarro13 · June 5, 2023, 4:50pm

Hi.

We are creating our cluster with the following command

hailctl dataproc \                              
    start cluster \
    --vep GRCh38 \
    --autoscaling-policy=autoscaling_policy \
    --requester-pays-allow-annotation-db \
    --packages gnomad \
    --requester-pays-allow-buckets gnomad-public-requester-pays \
    --secondary-worker-type=non-preemptible \
    --master-machine-type=n1-highmem-8 \
    --worker-machine-type=n1-highmem-8 \
    --worker-boot-disk-size=1000 \
    --preemptible-worker-boot-disk-size=1000 \
    --properties=dataproc:dataproc.logging.stackdriver.enable=true,dataproc:dataproc.monitoring.stackdriver.enable=true

We are connecting to the notebook via

 hailctl dataproc connect cluster notebook

We have also tried running it as a python script, since posting this issue, instead of a notebook using

hailctl dataproc submit cluster script.py

We are not getting an error message when running it this way. Instead, the last line of output before failing is

[Stage 188:===================================================(2586 + 1) / 2586]

looking in the hail log, the last line written is

2023-06-03 17:09:44 MemoryStore: INFO: Block broadcast_300 stored as values in memory (estimated size 13.2 GiB, free 8.5 GiB)

We are currently not doing anything with the output. Our plan is to store it on cloud to read into future scripts, but right now we are just testing the PCA function on a small test to test how it runs and get an idea of what the output is and how to store it.

our current script we are running looks like this

#!/usr/bin/env python
# coding: utf-8
# #### PCA methods test 
# In[2]:
## Hail Initialization
import hail as hl
from hail.plot import show
from pprint import pprint
hl.stop()
hl.init(spark_conf={"spark.executor.extraClassPath": "/foo/bar/baz.so"},log='/home/hail/hail.log')
#hl.stop()
#hl.init()
# In[3]:
#| code-fold: true
#| output: false
from io import StringIO
import time
import pandas as pd
#import pandas_gbq
# Ignore warning when renaming copied columns
pd.options.mode.chained_assignment = None
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Use for numerical instead of exponential log tick notation
from matplotlib.ticker import ScalarFormatter
import plotly.express as px
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from google.cloud import storage
import yaml
from google.cloud import bigquery as bq
from collections import Counter
# In[4]:
mt_gt = hl.read_matrix_table('gs://PATH_TO_MT')
# In[5]:
mt_gt.describe()
# In[ ]:
#filter MT based on population
mt_gt= mt_gt.filter_cols((mt_gt.super_pop == 'AMR'))
# In[13]:
print(f"Count of rows and columns: {mt_gt.count()}")
#663351127, 678
# In[21]:
mt_gt=mt_gt.select_cols()
mt_gt=mt_gt.select_rows()
mt_gt=mt_gt.select_entries(mt_gt.GT)
# In[23]:
eigenvalues, scores, loadings = hl.hwe_normalized_pca(mt_gt.GT, k=5)
print("PCA Done")
# In[ ]:
scores.describe()
# In[ ]:
eigenvalues.describe()
# In[ ]:
loadings.describe()

Topic		Replies	Views
Py4JError: An error occurred while calling o201.pyPersist Hail Query & hailctl	8	1661	January 8, 2020
py4j.protocol.Py4JNetworkError: Answer from Java side is empty Hail Query & hailctl	23	14144	March 3, 2020
Py4J Errors from protocol.py Hail Query & hailctl	4	1533	May 18, 2018
Pc_relate error Hail Query & hailctl	1	465	February 28, 2020
GWAS command error: py4j.protocol.Py4JNetworkError: Answer from Java side is empty Hail Query & hailctl	0	1107	October 26, 2018

Py4JError: An error occurred while calling o1.pyPersistTable when running hl.hwe_normalized_pca()

Related topics