Hi everyone,
I’m new to hail. I have a problem annotating vds using the KeyTable generated from a pandas dataframe. The code looks like this:
load gene expression data
residuals = pd.read_table(‘eQTL/data/residuals_5.txt’, index_col=0)
residuals = residuals.TCreate a permutation matrix
permutations = pd.concat([residuals[gene]]*n, axis=1)
permutations = permutations.apply(np.random.permutation)
permutations = pd.concat([residuals[gene], permutations], axis=1)
permutations.columns = [‘y’] + [‘p{}’.format(i+1) for i in range(n)]
permutations.reset_index(inplace=True)Create a KeyTable from pandas dataframe
kt = KeyTable.from_pandas(permutations).key_by(‘index’)
Annotate vds by the KeyTable
vds2_cis = vds2_cis.annotate_samples_table(kt, root=‘sa.pheno’)
Then I got the following error message:
Name: org.apache.toree.interpreter.broker.BrokerException
Message: Traceback (most recent call last):
File “/tmp/kernel-PySpark-34094fbd-6f73-467d-b19e-a06c898f987f/pyspark_runner.py”, line 194, in
eval(compiled_code)
File “”, line 1, in
File “”, line 2, in annotate_samples_table
File “/mnt/tmp/spark-aba52f07-bfd3-4ce2-9096-34956bfd6316/userFiles-83cc5db7-9eb6-47a3-a292-96e96b5c6b3c/hail-python.zip/hail/java.py”, line 121, in handle_py4j
‘Error summary: %s’ % (deepest, full, Env.hc().version, deepest))
FatalError: SparkException:
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/mnt/yarn/usercache/hadoop/filecache/260/__spark_libs__8515568189444678508.zip/spark-core_2.11-2.1.0.jar
java.io.EOFException
Alternatively, I tried to save the pandas dataframe to a file, and then load it using hc.import_table. This time it works fine:
Save the pandas dataframe to a file:
permutations.to_csv(‘permutations.txt’, sep=‘\t’, index=False)
Load this file as a KeyTable
kt2 = hc.import_table(‘permutations.txt’, impute=True, missing=‘’).key_by(‘index’)
Annotate vds by the KeyTable
vds2_cis = vds2_cis.annotate_samples_table(kt2, root=‘sa.pheno’)
This doesn’t make sense since kt and kt2 look almost identical. Loading it every time from a file leads to a huge IO overhead since I’m looping through 20k genes.
I’d highly appreciate it if anyone can help me troubleshooting the above error message I got by running “KeyTable.from_pandas” followed by “VariantDataset.annotate_samples_table”
Thank you very much!
Wei