A problem with KeyTable.from_pandas in hail v0.1

Hi everyone,

I’m new to hail. I have a problem annotating vds using the KeyTable generated from a pandas dataframe. The code looks like this:

load gene expression data

residuals = pd.read_table(‘eQTL/data/residuals_5.txt’, index_col=0)
residuals = residuals.T

Create a permutation matrix

permutations = pd.concat([residuals[gene]]*n, axis=1)
permutations = permutations.apply(np.random.permutation)
permutations = pd.concat([residuals[gene], permutations], axis=1)
permutations.columns = [‘y’] + [‘p{}’.format(i+1) for i in range(n)]
permutations.reset_index(inplace=True)

Create a KeyTable from pandas dataframe

kt = KeyTable.from_pandas(permutations).key_by(‘index’)

Annotate vds by the KeyTable

vds2_cis = vds2_cis.annotate_samples_table(kt, root=‘sa.pheno’)

Then I got the following error message:

Name: org.apache.toree.interpreter.broker.BrokerException
Message: Traceback (most recent call last):
File “/tmp/kernel-PySpark-34094fbd-6f73-467d-b19e-a06c898f987f/pyspark_runner.py”, line 194, in
eval(compiled_code)
File “”, line 1, in
File “”, line 2, in annotate_samples_table
File “/mnt/tmp/spark-aba52f07-bfd3-4ce2-9096-34956bfd6316/userFiles-83cc5db7-9eb6-47a3-a292-96e96b5c6b3c/hail-python.zip/hail/java.py”, line 121, in handle_py4j
‘Error summary: %s’ % (deepest, full, Env.hc().version, deepest))
FatalError: SparkException:
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/mnt/yarn/usercache/hadoop/filecache/260/__spark_libs__8515568189444678508.zip/spark-core_2.11-2.1.0.jar
java.io.EOFException

Alternatively, I tried to save the pandas dataframe to a file, and then load it using hc.import_table. This time it works fine:

Save the pandas dataframe to a file:

permutations.to_csv(‘permutations.txt’, sep=‘\t’, index=False)

Load this file as a KeyTable

kt2 = hc.import_table(‘permutations.txt’, impute=True, missing=‘’).key_by(‘index’)

Annotate vds by the KeyTable

vds2_cis = vds2_cis.annotate_samples_table(kt2, root=‘sa.pheno’)

This doesn’t make sense since kt and kt2 look almost identical. Loading it every time from a file leads to a huge IO overhead since I’m looping through 20k genes.

I’d highly appreciate it if anyone can help me troubleshooting the above error message I got by running “KeyTable.from_pandas” followed by “VariantDataset.annotate_samples_table”

Thank you very much!
Wei

This is a Spark setup problem. Your Spark worker nodes don’t have PySpark properly installed, and KeyTable.from_pandas just calls pandas to Spark conversion inside Spark.

Hi @tpoterba, Thanks for the quick reply! I’m curious why “KeyTable.from_pandas” itself doesn’t trigger the problem but “vds.annotate_samples_table” does? If the problem is in “vds.annotate_samples_table”, why the KeyTable imported from a file works?

Anyway, can you point me to any online tutorial regarding how to install PySpark properly? Thank you!

Spark (and Hail) compute lazily – Spark jobs won’t be run until they need to produce a result. This means that errors don’t always happen at the line of Python you expect them too, because multiple steps are fused together. The problem here was definitely the from_pandas – any Spark “action” on this table would have crashed.

@tpoterba That makes a lot of sense. Thanks for your explanation!