Hi.
We are creating our cluster with the following command
hailctl dataproc \
start cluster \
--vep GRCh38 \
--autoscaling-policy=autoscaling_policy \
--requester-pays-allow-annotation-db \
--packages gnomad \
--requester-pays-allow-buckets gnomad-public-requester-pays \
--secondary-worker-type=non-preemptible \
--master-machine-type=n1-highmem-8 \
--worker-machine-type=n1-highmem-8 \
--worker-boot-disk-size=1000 \
--preemptible-worker-boot-disk-size=1000 \
--properties=dataproc:dataproc.logging.stackdriver.enable=true,dataproc:dataproc.monitoring.stackdriver.enable=true
We are connecting to the notebook via
hailctl dataproc connect cluster notebook
We have also tried running it as a python script, since posting this issue, instead of a notebook using
hailctl dataproc submit cluster script.py
We are not getting an error message when running it this way. Instead, the last line of output before failing is
[Stage 188:===================================================(2586 + 1) / 2586]
looking in the hail log, the last line written is
2023-06-03 17:09:44 MemoryStore: INFO: Block broadcast_300 stored as values in memory (estimated size 13.2 GiB, free 8.5 GiB)
We are currently not doing anything with the output. Our plan is to store it on cloud to read into future scripts, but right now we are just testing the PCA function on a small test to test how it runs and get an idea of what the output is and how to store it.
our current script we are running looks like this
#!/usr/bin/env python
# coding: utf-8
# #### PCA methods test
# In[2]:
## Hail Initialization
import hail as hl
from hail.plot import show
from pprint import pprint
hl.stop()
hl.init(spark_conf={"spark.executor.extraClassPath": "/foo/bar/baz.so"},log='/home/hail/hail.log')
#hl.stop()
#hl.init()
# In[3]:
#| code-fold: true
#| output: false
from io import StringIO
import time
import pandas as pd
#import pandas_gbq
# Ignore warning when renaming copied columns
pd.options.mode.chained_assignment = None
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Use for numerical instead of exponential log tick notation
from matplotlib.ticker import ScalarFormatter
import plotly.express as px
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from google.cloud import storage
import yaml
from google.cloud import bigquery as bq
from collections import Counter
# In[4]:
mt_gt = hl.read_matrix_table('gs://PATH_TO_MT')
# In[5]:
mt_gt.describe()
# In[ ]:
#filter MT based on population
mt_gt= mt_gt.filter_cols((mt_gt.super_pop == 'AMR'))
# In[13]:
print(f"Count of rows and columns: {mt_gt.count()}")
#663351127, 678
# In[21]:
mt_gt=mt_gt.select_cols()
mt_gt=mt_gt.select_rows()
mt_gt=mt_gt.select_entries(mt_gt.GT)
# In[23]:
eigenvalues, scores, loadings = hl.hwe_normalized_pca(mt_gt.GT, k=5)
print("PCA Done")
# In[ ]:
scores.describe()
# In[ ]:
eigenvalues.describe()
# In[ ]:
loadings.describe()