Error with Table.from_pandas

Ruth_Eberhardt · February 21, 2022, 4:48pm

Hi
I’m using hail 0.2.83 and am having a problem with creating a table from a pandas df.

I’m attempting to run gnomad’s assign_population_pcs: gnomad.sample_qc.ancestry — gnomad master documentation

The error I get is:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [68], in <module>
      1 from gnomad.sample_qc.ancestry import assign_population_pcs
----> 2 pop_ht, pop_clf = assign_population_pcs(pca_scores, pca_scores.scores, known_col="cohort", n_estimators=100, prop_train=0.8, min_prob=0.5)

File ~/venv/lib/python3.8/site-packages/gnomad/sample_qc/ancestry.py:232, in assign_population_pcs(pop_pca_scores, pc_cols, known_col, fit, seed, prop_train, n_estimators, min_prob, output_col, missing_label)
    224 logger.info(
    225     "Found the following sample count after population assignment: %s",
    226     ", ".join(
    227         f"{pop}: {count}" for pop, count in Counter(pop_pc_pd[output_col]).items()
    228     ),
    229 )
    231 if hail_input:
--> 232     pops_ht = hl.Table.from_pandas(pop_pc_pd, key=list(pop_pca_scores.key))
    233     pops_ht.annotate_globals(
    234         assign_pops_from_pc_params=hl.struct(min_assignment_prob=min_prob)
    235     )
    236     return pops_ht, pop_clf

File <decorator-gen-1085>:2, in from_pandas(df, key)

File ~/venv/lib/python3.8/site-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File ~/venv/lib/python3.8/site-packages/hail/table.py:3293, in Table.from_pandas(df, key)
   3271 @staticmethod
   3272 @typecheck(df=pandas.DataFrame,
   3273            key=oneof(str, sequenceof(str)))
   3274 def from_pandas(df, key=[]) -> 'Table':
   3275     """Create table from Pandas DataFrame
   3276 
   3277     Examples
   (...)
   3291     :class:`.Table`
   3292     """
-> 3293     return Env.spark_backend('from_pandas').from_pandas(df, key)

File ~/venv/lib/python3.8/site-packages/hail/backend/spark_backend.py:325, in SparkBackend.from_pandas(self, df, key)
    324 def from_pandas(self, df, key):
--> 325     return Table.from_spark(Env.spark_session().createDataFrame(df), key)

File ~/venv/lib/python3.8/site-packages/pyspark/sql/session.py:673, in SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
    670     has_pandas = False
    671 if has_pandas and isinstance(data, pandas.DataFrame):
    672     # Create a DataFrame from pandas DataFrame.
--> 673     return super(SparkSession, self).createDataFrame(
    674         data, schema, samplingRatio, verifySchema)
    675 return self._create_dataframe(data, schema, samplingRatio, verifySchema)

File ~/venv/lib/python3.8/site-packages/pyspark/sql/pandas/conversion.py:300, in SparkConversionMixin.createDataFrame(self, data, schema, samplingRatio, verifySchema)
    298             raise
    299 data = self._convert_from_pandas(data, schema, timezone)
--> 300 return self._create_dataframe(data, schema, samplingRatio, verifySchema)

File ~/venv/lib/python3.8/site-packages/pyspark/sql/session.py:700, in SparkSession._create_dataframe(self, data, schema, samplingRatio, verifySchema)
    698     rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    699 else:
--> 700     rdd, schema = self._createFromLocal(map(prepare, data), schema)
    701 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    702 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

File ~/venv/lib/python3.8/site-packages/pyspark/sql/session.py:512, in SparkSession._createFromLocal(self, data, schema)
    509     data = list(data)
    511 if schema is None or isinstance(schema, (list, tuple)):
--> 512     struct = self._inferSchemaFromList(data, names=schema)
    513     converter = _create_converter(struct)
    514     data = map(converter, data)

File ~/venv/lib/python3.8/site-packages/pyspark/sql/session.py:439, in SparkSession._inferSchemaFromList(self, data, names)
    437 if not data:
    438     raise ValueError("can not infer schema from empty dataset")
--> 439 schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
    440 if _has_nulltype(schema):
    441     raise ValueError("Some of types cannot be determined after inferring")

File ~/venv/lib/python3.8/site-packages/pyspark/sql/types.py:1107, in _merge_type(a, b, name)
   1105 if isinstance(a, StructType):
   1106     nfs = dict((f.name, f.dataType) for f in b.fields)
-> 1107     fields = [StructField(f.name, _merge_type(f.dataType, nfs.get(f.name, NullType()),
   1108                                               name=new_name(f.name)))
   1109               for f in a.fields]
   1110     names = set([f.name for f in fields])
   1111     for n in nfs:

File ~/venv/lib/python3.8/site-packages/pyspark/sql/types.py:1107, in <listcomp>(.0)
   1105 if isinstance(a, StructType):
   1106     nfs = dict((f.name, f.dataType) for f in b.fields)
-> 1107     fields = [StructField(f.name, _merge_type(f.dataType, nfs.get(f.name, NullType()),
   1108                                               name=new_name(f.name)))
   1109               for f in a.fields]
   1110     names = set([f.name for f in fields])
   1111     for n in nfs:

File ~/venv/lib/python3.8/site-packages/pyspark/sql/types.py:1102, in _merge_type(a, b, name)
   1099     return a
   1100 elif type(a) is not type(b):
   1101     # TODO: type cast (such as int -> long)
-> 1102     raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b))))
   1104 # same type
   1105 if isinstance(a, StructType):

TypeError: field cohort: Can not merge type <class 'pyspark.sql.types.StructType'> and <class 'pyspark.sql.types.StringType'>

I’ve run through the function from it’s source code and have found that the step that fails is very near the end of the function where a pandas dataframe is converted to a hail table:
pops_ht = hl.Table.from_pandas(pop_pc_pd, key=list(pop_pca_scores.key))

The error is the same as that given above:

TypeError: field known_pop: Can not merge type <class 'pyspark.sql.types.StructType'> and <class 'pyspark.sql.types.StringType'>

The datatypes of the pop_pc_pd are:

s              string
known_pop      string
pca_scores     object
pop            object
prob_AFR      float64
prob_AMR      float64
prob_EAS      float64
prob_EUR      float64
prob_SAS      float64
dtype: object

list(pop_pca_scores.key) = [‘s’]

Can you suggest a fix for this?
It worked with a previous version (0.2.62 running on python 3.6) but I had to upgrade due to the recent log4j vulnerabilities.

Thank you!

johnc1231 · February 22, 2022, 1:31pm

Hey, sorry you ran into this.

Can you try latest version of hail (0.2.85)? We actually rewrote from_pandas to not depend on pyspark for 0.2.84 release, so hopefully it resolves your issue.

Ruth_Eberhardt · February 22, 2022, 4:17pm

Hi John

Thank you for your help
I’ve just tried this with 0.2.85, but get a different error this time. The table contains NAs in the column containing known populations, and it looks as though from_pandas is unable to cope with NAs

ExpressionException                       Traceback (most recent call last)
Input In [5], in <module>
      1 from gnomad.sample_qc.ancestry import assign_population_pcs
----> 2 pop_ht, pop_clf = assign_population_pcs(pca_scores, pca_scores.scores, known_col="cohort", n_estimators=100, prop_train=0.8, min_prob=0.5)

File ~/venv/lib/python3.8/site-packages/gnomad/sample_qc/ancestry.py:232, in assign_population_pcs(pop_pca_scores, pc_cols, known_col, fit, seed, prop_train, n_estimators, min_prob, output_col, missing_label)
    224 logger.info(
    225     "Found the following sample count after population assignment: %s",
    226     ", ".join(
    227         f"{pop}: {count}" for pop, count in Counter(pop_pc_pd[output_col]).items()
    228     ),
    229 )
    231 if hail_input:
--> 232     pops_ht = hl.Table.from_pandas(pop_pc_pd, key=list(pop_pca_scores.key))
    233     pops_ht.annotate_globals(
    234         assign_pops_from_pc_params=hl.struct(min_assignment_prob=min_prob)
    235     )
    236     return pops_ht, pop_clf

File <decorator-gen-1085>:2, in from_pandas(df, key)

File ~/venv/lib/python3.8/site-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File ~/venv/lib/python3.8/site-packages/hail/table.py:3362, in Table.from_pandas(df, key)
   3359     if type_hint is not None:
   3360         hl_type_hints[field] = type_hint
-> 3362 new_table = hl.Table.parallelize(data, partial_type=hl_type_hints)
   3363 return new_table if not key else new_table.key_by(*key)

File <decorator-gen-1007>:2, in parallelize(cls, rows, schema, key, n_partitions, partial_type)

File ~/venv/lib/python3.8/site-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File ~/venv/lib/python3.8/site-packages/hail/table.py:535, in Table.parallelize(cls, rows, schema, key, n_partitions, partial_type)
    533 if partial_type is not None:
    534     partial_type = hl.tarray(hl.tstruct(**partial_type))
--> 535 rows = to_expr(rows, dtype=dtype, partial_type=partial_type)
    536 if not isinstance(rows.dtype.element_type, tstruct):
    537     raise TypeError("'parallelize' expects an array with element type 'struct', found '{}'"
    538                     .format(rows.dtype))

File ~/venv/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py:274, in to_expr(e, dtype, partial_type)
    272         raise TypeError("expected expression of type '{}', found expression of type '{}'".format(dtype, e.dtype))
    273     return e
--> 274 return cast_expr(e, dtype, partial_type)

File ~/venv/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py:280, in cast_expr(e, dtype, partial_type)
    278 assert dtype is None or partial_type is None
    279 if not dtype:
--> 280     dtype = impute_type(e, partial_type)
    281 x = _to_expr(e, dtype)
    282 if isinstance(x, Expression):

File ~/venv/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py:128, in impute_type(x, partial_type)
    127 def impute_type(x, partial_type=None):
--> 128     t = _impute_type(x, partial_type=partial_type)
    129     raise_for_holes(t)
    130     return t

File ~/venv/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py:178, in _impute_type(x, partial_type)
    176 if len(x) == 0:
    177     return partial_type
--> 178 ts = {_impute_type(element, partial_type.element_type) for element in x}
    179 unified_type = super_unify_types(*ts)
    180 if unified_type is None:

File ~/venv/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py:178, in <setcomp>(.0)
    176 if len(x) == 0:
    177     return partial_type
--> 178 ts = {_impute_type(element, partial_type.element_type) for element in x}
    179 unified_type = super_unify_types(*ts)
    180 if unified_type is None:

File ~/venv/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py:168, in _impute_type(x, partial_type)
    166 elif isinstance(x, Struct) or isinstance(x, dict) and isinstance(partial_type, tstruct):
    167     partial_type = refine(partial_type, hl.tstruct())
--> 168     t = tstruct(**{k: _impute_type(x[k], partial_type.get(k)) for k in x})
    169     return t
    170 elif isinstance(x, tuple):

File ~/venv/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py:168, in <dictcomp>(.0)
    166 elif isinstance(x, Struct) or isinstance(x, dict) and isinstance(partial_type, tstruct):
    167     partial_type = refine(partial_type, hl.tstruct())
--> 168     t = tstruct(**{k: _impute_type(x[k], partial_type.get(k)) for k in x})
    169     return t
    170 elif isinstance(x, tuple):

File ~/venv/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py:226, in _impute_type(x, partial_type)
    223     raise ExpressionException("'switch' and 'case' expressions must end with a call to either"
    224                               "'default' or 'or_missing'")
    225 else:
--> 226     raise ExpressionException("Hail cannot automatically impute type of {}: {}".format(type(x), x))

ExpressionException: Hail cannot automatically impute type of <class 'pandas._libs.missing.NAType'>: <NA>

johnc1231 · February 22, 2022, 4:52pm

Agh, that’s a bug. Didn’t realize pandas had added nullable integers, which is what I"m assuming this is.

We will fix shortly. Temporary fix for you is that if that column were changed to a column of floats with NaNs for the unknowns, I think we’d import that correctly, and then in hail you can change NaNs to missing values.

klaricch · February 23, 2022, 6:22pm

I recently got the same error trying to use “assign_population_pcs” and am using a temporary fix for now, but do you have an estimate of how soon the fix will be ready?

johnc1231 · February 23, 2022, 7:57pm

Version 0.2.86 will fix. Should come this week. Have a seemingly working branch currently.

klaricch · February 23, 2022, 8:34pm

OK, thanks!

Ruth_Eberhardt · February 23, 2022, 8:52pm

Fantastic news, thank you!

johnc1231 · February 25, 2022, 10:16pm

We released hail 0.2.86, which will hopefully resolve all your pandas issues. Please let me know if it does, and happy to work with you to fix if it does not.

Ruth_Eberhardt · February 28, 2022, 10:51am

Hi John
Thank you for releasing v 0.2.86
I have tried this and unfortunately still get an error. The wording of the error has changed slightly:

INFO (gnomad.sample_qc.ancestry 224): Found the following sample count after population assignment: EUR: 15009, AFR: 864, SAS: 576, oth: 149, AMR: 356, EAS: 523
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/venv/lib/python3.8/site-packages/hail/expr/functions.py:252, in literal(x, dtype)
    251 try:
--> 252     dtype._traverse(x, typecheck_expr)
    253 except TypeError as e:

File ~/venv/lib/python3.8/site-packages/hail/expr/types.py:801, in tarray._traverse(self, obj, f)
    800 for elt in obj:
--> 801     self.element_type._traverse(elt, f)

File ~/venv/lib/python3.8/site-packages/hail/expr/types.py:1196, in tstruct._traverse(self, obj, f)
   1195 t = self[k]
-> 1196 t._traverse(v, f)

File ~/venv/lib/python3.8/site-packages/hail/expr/types.py:285, in HailType._traverse(self, obj, f)
    276 """Traverse a nested type and object.
    277 
    278 Parameters
   (...)
    283     the function returns ``True``.
    284 """
--> 285 f(self, obj)

File ~/venv/lib/python3.8/site-packages/hail/expr/functions.py:240, in literal.<locals>.typecheck_expr(t, x)
    239 else:
--> 240     t._typecheck_one_level(x)
    241     return True

File ~/venv/lib/python3.8/site-packages/hail/expr/types.py:561, in _tstr._typecheck_one_level(self, annotation)
    560 def _typecheck_one_level(self, annotation):
--> 561     if annotation and not isinstance(annotation, str):
    562         raise TypeError("type 'str' expected Python 'str', but found type '%s'" % type(annotation))

File ~/venv/lib/python3.8/site-packages/pandas/_libs/missing.pyx:446, in pandas._libs.missing.NAType.__bool__()

TypeError: boolean value of NA is ambiguous

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Input In [3], in <cell line: 2>()
      1 from gnomad.sample_qc.ancestry import assign_population_pcs
----> 2 pop_ht, pop_clf = assign_population_pcs(pca_scores, pca_scores.scores, known_col="known_pop", n_estimators=100, prop_train=0.8, min_prob=0.5)

File ~/venv/lib/python3.8/site-packages/gnomad/sample_qc/ancestry.py:232, in assign_population_pcs(pop_pca_scores, pc_cols, known_col, fit, seed, prop_train, n_estimators, min_prob, output_col, missing_label)
    224 logger.info(
    225     "Found the following sample count after population assignment: %s",
    226     ", ".join(
    227         f"{pop}: {count}" for pop, count in Counter(pop_pc_pd[output_col]).items()
    228     ),
    229 )
    231 if hail_input:
--> 232     pops_ht = hl.Table.from_pandas(pop_pc_pd, key=list(pop_pca_scores.key))
    233     pops_ht.annotate_globals(
    234         assign_pops_from_pc_params=hl.struct(min_assignment_prob=min_prob)
    235     )
    236     return pops_ht, pop_clf

File <decorator-gen-1085>:2, in from_pandas(df, key)

File ~/venv/lib/python3.8/site-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File ~/venv/lib/python3.8/site-packages/hail/table.py:3377, in Table.from_pandas(df, key)
   3374     if type_hint is not None:
   3375         hl_type_hints[field] = type_hint
-> 3377 new_table = hl.Table.parallelize(data, partial_type=hl_type_hints)
   3378 return new_table if not key else new_table.key_by(*key)

File <decorator-gen-1007>:2, in parallelize(cls, rows, schema, key, n_partitions, partial_type)

File ~/venv/lib/python3.8/site-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File ~/venv/lib/python3.8/site-packages/hail/table.py:536, in Table.parallelize(cls, rows, schema, key, n_partitions, partial_type)
    534 if partial_type is not None:
    535     partial_type = hl.tarray(hl.tstruct(**partial_type))
--> 536 rows = to_expr(rows, dtype=dtype, partial_type=partial_type)
    537 if not isinstance(rows.dtype.element_type, tstruct):
    538     raise TypeError("'parallelize' expects an array with element type 'struct', found '{}'"
    539                     .format(rows.dtype))

File ~/venv/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py:275, in to_expr(e, dtype, partial_type)
    273         raise TypeError("expected expression of type '{}', found expression of type '{}'".format(dtype, e.dtype))
    274     return e
--> 275 return cast_expr(e, dtype, partial_type)

File ~/venv/lib/python3.8/site-packages/hail/expr/expressions/base_expression.py:286, in cast_expr(e, dtype, partial_type)
    284     return x
    285 else:
--> 286     return hl.literal(x, dtype)

File <decorator-gen-673>:2, in literal(x, dtype)

File ~/venv/lib/python3.8/site-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File ~/venv/lib/python3.8/site-packages/hail/expr/functions.py:254, in literal(x, dtype)
    252         dtype._traverse(x, typecheck_expr)
    253     except TypeError as e:
--> 254         raise TypeError("'literal': object did not match the passed type '{}'"
    255                         .format(dtype)) from e
    257 if wrapper['has_expr']:
    258     return literal(hl.eval(to_expr(x, dtype)), dtype)

TypeError: 'literal': object did not match the passed type 'array<struct{s: str, known_pop: str, pca_scores: array<float64>, pop: str, prob_AFR: float64, prob_AMR: float64, prob_EAS: float64, prob_EUR: float64, prob_SAS: float64}>'

I’ve dug into the gnomad sample_qc.ancestry code on gnomad.sample_qc.ancestry — gnomad master documentation

The pandas dataframe that this code attempts to convert to a hail table using pops_ht = hl.Table.from_pandas(pop_pc_pd, key=list(pop_pca_scores.key)) looks like this (sample names redacted):

pop_pc_pd.head()

s	known_pop	pca_scores	pop	prob_AFR	prob_AMR	prob_EAS	prob_EUR	prob_SAS
0	EGAN0000xxxxxxx	<NA>	[0.047423655871611695, 0.020419155961920857, 0...	EUR	0.07	0.06	0.0	0.84	0.03
1	EGAN0000xxxxxxx	<NA>	[0.04086149400944195, 0.012434095833431868, -0...	EUR	0.09	0.04	0.0	0.84	0.03
2	EGAN0000xxxxxxx	<NA>	[-0.4068123837193344, 0.23227499201591015, 0.0...	AFR	1.00	0.00	0.0	0.00	0.00
3	EGAN0000xxxxxxx	<NA>	[0.04466743333831239, 0.0039506319681347326, 0...	EUR	0.06	0.06	0.0	0.85	0.03
4	EGAN0000xxxxxxx	<NA>	[0.036603157904434365, -0.003719805306811391, ...	EUR	0.03	0.05	0.0	0.88	0.04

pop_pc_pd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17477 entries, 0 to 17476
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   s           17477 non-null  string 
 1   known_pop   2504 non-null   string 
 2   pca_scores  17477 non-null  object 
 3   pop         17477 non-null  object 
 4   prob_AFR    17477 non-null  float64
 5   prob_AMR    17477 non-null  float64
 6   prob_EAS    17477 non-null  float64
 7   prob_EUR    17477 non-null  float64
 8   prob_SAS    17477 non-null  float64
dtypes: float64(5), object(2), string(2)
memory usage: 1.2+ MB

The values occur in the known_pops column, which is a string. I hope this helps.

johnc1231 · February 28, 2022, 2:19pm

Thanks for the error report and your patience. I have another try. Can you try installing this wheel and seeing if it resolves?

Ruth_Eberhardt · February 28, 2022, 2:57pm

Hi John
I think it works - the function runs without an error. However I do have a problem with my hail set up and can’t view the output at the moment.
When I installed hail version 0.2.83 I used spark 3.1.2, which seems to work.
With versions 0.2.85, the released version of 0.2.86 and your patched version I get an error on startup, but it appeared to work OK until I tried to view the output hail table from this function with .show()

The error follows. Can you suggest the best version of spark/hadoop to use and how to set it up? I’m afraid I am a newcomer to this!

Thank you again for your help!

[Stage 5:>                                                          (0 + 1) / 1]
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
File ~/venv/lib/python3.8/site-packages/IPython/core/formatters.py:707, in PlainTextFormatter.__call__(self, obj)
    700 stream = StringIO()
    701 printer = pretty.RepresentationPrinter(stream, self.verbose,
    702     self.max_width, self.newline,
    703     max_seq_length=self.max_seq_length,
    704     singleton_pprinters=self.singleton_printers,
    705     type_pprinters=self.type_printers,
    706     deferred_pprinters=self.deferred_printers)
--> 707 printer.pretty(obj)
    708 printer.flush()
    709 return stream.getvalue()

File ~/venv/lib/python3.8/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File ~/venv/lib/python3.8/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File ~/venv/lib/python3.8/site-packages/hail/table.py:1350, in Table._Show.__repr__(self)
   1349 def __repr__(self):
-> 1350     return self.__str__()

File ~/venv/lib/python3.8/site-packages/hail/table.py:1347, in Table._Show.__str__(self)
   1346 def __str__(self):
-> 1347     return self._ascii_str()

File ~/venv/lib/python3.8/site-packages/hail/table.py:1373, in Table._Show._ascii_str(self)
   1370         return s[:truncate - 3] + "..."
   1371     return s
-> 1373 rows, has_more, dtype = self.data()
   1374 fields = list(dtype)
   1375 trunc_fields = [trunc(f) for f in fields]

File ~/venv/lib/python3.8/site-packages/hail/table.py:1357, in Table._Show.data(self)
   1355     row_dtype = t.row.dtype
   1356     t = t.select(**{k: hl._showstr(v) for (k, v) in t.row.items()})
-> 1357     rows, has_more = t._take_n(self.n)
   1358     self._data = (rows, has_more, row_dtype)
   1359 return self._data

File ~/venv/lib/python3.8/site-packages/hail/table.py:1504, in Table._take_n(self, n)
   1502     has_more = False
   1503 else:
-> 1504     rows = self.take(n + 1)
   1505     has_more = len(rows) > n
   1506     rows = rows[:n]

File <decorator-gen-1047>:2, in take(self, n, _localize)

File ~/venv/lib/python3.8/site-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File ~/venv/lib/python3.8/site-packages/hail/table.py:2174, in Table.take(self, n, _localize)
   2140 @typecheck_method(n=int, _localize=bool)
   2141 def take(self, n, _localize=True):
   2142     """Collect the first `n` rows of the table into a local list.
   2143 
   2144     Examples
   (...)
   2171         List of row structs.
   2172     """
-> 2174     return self.head(n).collect(_localize)

File <decorator-gen-1041>:2, in collect(self, _localize, _timed)

File ~/venv/lib/python3.8/site-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File ~/venv/lib/python3.8/site-packages/hail/table.py:1973, in Table.collect(self, _localize, _timed)
   1971 e = construct_expr(rows_ir, hl.tarray(t.row.dtype))
   1972 if _localize:
-> 1973     return Env.backend().execute(e._ir, timed=_timed)
   1974 else:
   1975     return e

File ~/venv/lib/python3.8/site-packages/hail/backend/py4j_backend.py:110, in Py4JBackend.execute(self, ir, timed)
    104     message_and_trace = (f'{error_message}\n'
    105                          '------------\n'
    106                          'Hail stack trace:\n'
    107                          f'{better_stack_trace}')
    108     raise HailUserError(message_and_trace) from None
--> 110 raise e

File ~/venv/lib/python3.8/site-packages/hail/backend/py4j_backend.py:86, in Py4JBackend.execute(self, ir, timed)
     84 # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     85 try:
---> 86     result_tuple = self._jhc.backend().executeEncode(jir, stream_codec)
     87     (result, timings) = (result_tuple._1(), result_tuple._2())
     88     value = ir.typ._from_encoding(result)

File ~/venv/lib/python3.8/site-packages/py4j/java_gateway.py:1304, in JavaMember.__call__(self, *args)
   1298 command = proto.CALL_COMMAND_NAME +\
   1299     self.command_header +\
   1300     args_command +\
   1301     proto.END_COMMAND_PART
   1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
   1305     answer, self.gateway_client, self.target_id, self.name)
   1307 for temp_arg in temp_args:
   1308     temp_arg._detach()

File ~/venv/lib/python3.8/site-packages/hail/backend/py4j_backend.py:29, in handle_java_exception.<locals>.deco(*args, **kwargs)
     27         raise FatalError('Error summary: %s' % (deepest,), error_id) from None
     28     else:
---> 29         raise FatalError('%s\n\nJava stack trace:\n%s\n'
     30                          'Hail version: %s\n'
     31                          'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     32 except pyspark.sql.utils.CapturedException as e:
     33     raise FatalError('%s\n\nJava stack trace:\n%s\n'
     34                      'Hail version: %s\n'
     35                      'Error summary: %s' % (e.desc, e.stackTrace, hail.__version__, e.desc)) from None

FatalError: InvalidClassException: org.apache.spark.sql.Row; local class incompatible: stream classdesc serialVersionUID = -7318765233437460818, local class serialVersionUID = 5249691020390965320

johnc1231 · February 28, 2022, 3:23pm

Is this running locally on your laptop, or on a cluster? You shouldn’t have to manually install Spark manually on your laptop, pip install hail should pull in pyspark as well and it should all just work out.

Ruth_Eberhardt · February 28, 2022, 3:51pm

I’m running on a cluster which is initiated using GitHub - wtsi-hgi/osdataproc: osdataproc is a command-line tool for creating an OpenStack cluster with Apache Spark and Apache Hadoop configured. It comes with JupyterLab and Hail, a genomic data analysis library built on Spark installed, as well as Netdata for monitoring.
It looks like it downloads python packages (including hail), then hadoop, then spark. So I guess this may result in more than one version of spark on the cluster and cause clashes. Not sure how to test or fix this though!

johnc1231 · February 28, 2022, 7:29pm

Just took a peek at that repo. You might have to manually specify pyspark 3.1.2 here: updates to recent versions of hail and spark due to log4j vulnerabili… · wtsi-hgi/osdataproc@33c1634 · GitHub

or at least ensure it starts with 3.1. The hail that we release on pypi is currently built with 3.1.2, and we will warn if that’s not the spark version that’s installed. It’s often the case that it still works with slightly different versions (say, 3.1.3), and a user can recompile Hail for specific Spark versions, but the default pip released version is expecting Spark 3.1.2 currently.

Ruth_Eberhardt · March 1, 2022, 4:43pm

Hi John

There is something really odd going on. I’ve been running a number of tests today and can’t quite figure it out. I’ve been testing with the wheel file you sent above as for some reason 0.2.87 isn’t available on pip.

I’ve tried a couple of different versions of spark, and followed the configuration instructions on the message hail gives about spark

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/backend/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.1.2

I’m running a pretty simple pipeline:

from hail import Table
import os
import pprint
from pprint import pformat
import argparse
import hail as hl
import pandas as pd
import numpy as np
import pyspark
from gnomad.sample_qc.ancestry import assign_population_pcs

sc = pyspark.SparkContext()
tmp_dir = "hdfs://spark-master:9820/"
lustre_dir = "file:///lustre/scratch123/qc/"
hl.init(sc=sc, tmp_dir=tmp_dir, default_reference="GRCh38")

pca_scores_ht_file = lustre_dir + "matrixtables/pca_scores_after_pruning.ht"
pca_scores = hl.read_table(pca_scores_ht_file)
pop_ht, pop_clf = assign_population_pcs(pca_scores, pca_scores.scores, known_col="known_pop", n_estimators=100, prop_train=0.8, min_prob=0.5)

all works well until I decide to look at the output using pop_ht.show()

It then gives the error I had above.
However, if I save the hail table to a file and then load it from that file and view it works.
I really don’t understand why this could be

Thank you again for all your help.

johnc1231 · March 1, 2022, 5:09pm

Sorry about the pip 0.2.87 issue. We had a problem with pip last night, it should be released now.

I would think that if you’re using the newest pip release and spark 3.1.2, that error should be resolved. I think that error is basically “There’s two versions of a particular Java class in existence, and they’re getting mismatched”. It’s possible that different code paths will cause that to be triggered or not depending on whether that class gets used.

johnc1231 · March 1, 2022, 8:02pm

Scratch the above, the issue with deploying 0.2.87 remains unresolved. We are in the process of releasing 0.2.88 since the 0.2.87 release was broken.

danking · March 1, 2022, 9:50pm

Hey @Ruth_Eberhardt ,

Apologies for the churn today. We strive for a faster response to user’s issues. You should find 0.2.88 available in PyPI: hail · PyPI.

Ruth_Eberhardt · March 2, 2022, 8:34am

Hi John and Dan

Thank you so much for all your help. I have installed 0.2.88 and it is working well.

Topic		Replies	Views
AttributeError: 'DataFrame' object has no attribute 'to_spark' Help [0.1]	19	7446	August 1, 2018
From_pandas triggering python different versions error Hail Query & hailctl	20	991	May 1, 2020
Hail 0.2 - Changes to data structure in the newest version? Pipeline broken in multiple places Hail Query & hailctl	4	886	May 6, 2018
Table from pandas dataframe/aggregate problem Hail Query & hailctl	20	1503	January 23, 2020
A problem with KeyTable.from_pandas in hail v0.1 Help [0.1]	4	747	April 12, 2018

Error with Table.from_pandas

Related topics