Subset (matrix) table to a medium-sized list of variants

I’d like to subset a table (or matrix table) containing individual-level data to a list of a couple of hundred variants. Say the list of variants is something like variants = [["10", 123, "G", "C"], ["10", 456, "T", "A"], ...]. I have tried:

  • hl.filter_intervals() with one interval of length one for each variant, but that takes forever, and also doesn’t check whether the alleles are equal,
  • from functools import reduce
    from operator import or_
    
    match_exprs = [(mt.locus.contig == contig) & (mt.locus.position == pos) & (mt.alleles == [ref, alt]) for contig, pos, ref, alt in variants]
    mt_subset = mt.filter_rows(reduce(or_, match_exprs))
    
    which works nicely and reasonably fast for, say, 50 variants, but crashes with a StackOverflow error for 300 variants,
  • mt_subset = mt.filter_rows(hl.any(lambda x: (mt.locus.contig == hl.literal(x[0])) & (mt.locus.position == hl.literal(int(x[1]))) & (mt.alleles == hl.literal(x[2:])), variants))` 
    
    along the lines of which ChatGPT pointed me, but that crashes with something like the following error (copied from the actual failure and not my slight rewording of the actual code above):

Traceback (most recent call last):
[…]
, in _subset_matrix_table_to_variants
return mt.filter_rows(hl.any(lambda x: (mt.locus.contig == hl.literal(x[0])) & (mt.locus.position == hl.literal(int(x[1]))) & (mt.alleles == hl.literal(x[2:])), variants))
File “/app/.venv/lib/python3.10/site-packages/hail/expr/functions.py”, line 3531, in any
return collection.any(f)
File “”, line 2, in any
File “/app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py”, line 577, in wrapper
return original_func(*args, **kwargs)
File “/app/.venv/lib/python3.10/site-packages/hail/expr/expressions/typed_expressions.py”, line 68, in any
return hl.array(self).fold(lambda accum, elt: accum | f(elt), False)
File “”, line 2, in fold
File “/app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py”, line 577, in wrapper
return original_func(*args, **kwargs)
File “/app/.venv/lib/python3.10/site-packages/hail/expr/expressions/typed_expressions.py”, line 221, in fold
return collection._to_stream().fold(lambda x, y: f(x, y), zero)
File “”, line 2, in fold
File “/app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py”, line 577, in wrapper
return original_func(*args, **kwargs)
File “/app/.venv/lib/python3.10/site-packages/hail/expr/expressions/typed_expressions.py”, line 4522, in fold
body = to_expr(f(accum_ref, elt_ref))
File “/app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py”, line 364, in f
ret = x(*args)
File “/app/.venv/lib/python3.10/site-packages/hail/expr/expressions/typed_expressions.py”, line 221, in
return collection._to_stream().fold(lambda x, y: f(x, y), zero)
File “/app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py”, line 364, in f
ret = x(*args)
File “/app/.venv/lib/python3.10/site-packages/hail/expr/expressions/typed_expressions.py”, line 68, in
return hl.array(self).fold(lambda accum, elt: accum | f(elt), False)
File “/app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py”, line 364, in f
ret = x(*args)
File “/app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py”, line 364, in f
ret = x(*args)
File “/app/linkage_disequilibrium/ld_hail/pearson_correlations.py”, line 209, in
return mt.filter_rows(hl.any(lambda x: (mt.locus.contig == hl.literal(x[0])) & (mt.locus.position == hl.literal(x[1])) & (mt.alleles == hl.literal(x[2:])), ids))
File “”, line 2, in literal
File “/app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py”, line 577, in wrapper
return original_func(*args, **kwargs)
File “/app/.venv/lib/python3.10/site-packages/hail/expr/functions.py”, line 261, in literal
return literal(hl.eval(to_expr(x, dtype)), dtype)
File “”, line 2, in eval
File “/app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py”, line 577, in wrapper
return original_func(*args, **kwargs)
File “/app/.venv/lib/python3.10/site-packages/hail/expr/expressions/expression_utils.py”, line 223, in eval
return eval_timed(expression)[0]
File “”, line 2, in eval_timed
File “/app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py”, line 577, in wrapper
return original_func(*args, **kwargs)
File “/app/.venv/lib/python3.10/site-packages/hail/expr/expressions/expression_utils.py”, line 189, in eval_timed
return _eval_many(expression, timed=True, name=‘eval_timed’)[0]
File “/app/.venv/lib/python3.10/site-packages/hail/expr/expressions/expression_utils.py”, line 150, in _eval_many
return Env.backend().execute_many(*irs, timed=timed)
File “/app/.venv/lib/python3.10/site-packages/hail/backend/backend.py”, line 38, in execute_many
return [self.execute(MakeTuple([ir]), timed=timed)[0] for ir in irs]
File “/app/.venv/lib/python3.10/site-packages/hail/backend/backend.py”, line 38, in
return [self.execute(MakeTuple([ir]), timed=timed)[0] for ir in irs]
File “/app/.venv/lib/python3.10/site-packages/hail/backend/py4j_backend.py”, line 94, in execute
jir = self._to_java_value_ir(ir)
File “/app/.venv/lib/python3.10/site-packages/hail/backend/spark_backend.py”, line 280, in _to_java_value_ir
return self._to_java_ir(ir, self._parse_value_ir)
File “/app/.venv/lib/python3.10/site-packages/hail/backend/spark_backend.py”, line 276, in _to_java_ir
ir._jir = parse(r(finalize_randomness(ir)), ir_map=r.jirs)
File “/app/.venv/lib/python3.10/site-packages/hail/backend/spark_backend.py”, line 245, in _parse_value_ir
return self._jbackend.parse_value_ir(
File “/app/.venv/lib/python3.10/site-packages/py4j/java_gateway.py”, line 1304, in call
return_value = get_return_value(
File “/app/.venv/lib/python3.10/site-packages/hail/backend/py4j_backend.py”, line 21, in deco
return f(*args, **kwargs)
File “/app/.venv/lib/python3.10/site-packages/py4j/protocol.py”, line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o46.parse_value_ir.
: java.util.NoSuchElementException: key not found: __uid_4
at scala.collection.immutable.Map$Map1.apply(Map.scala:114)
at is.hail.expr.ir.Env.apply(Env.scala:128)
at is.hail.expr.ir.IRParser$.ir_value_expr_1(Parser.scala:890)
at is.hail.expr.ir.IRParser$.$anonfun$ir_value_expr$1(Parser.scala:820)
at is.hail.utils.StackSafe$More.advance(StackSafe.scala:64)
at is.hail.utils.StackSafe$.run(StackSafe.scala:16)
at is.hail.utils.StackSafe$StackFrame.run(StackSafe.scala:32)
at is.hail.expr.ir.IRParser$.$anonfun$parse_value_ir$1(Parser.scala:2072)
at is.hail.expr.ir.IRParser$.parse(Parser.scala:2068)
at is.hail.expr.ir.IRParser$.parse_value_ir(Parser.scala:2072)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_value_ir$2(SparkBackend.scala:710)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:70)
at is.hail.utils.package$.using(package.scala:635)
at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:70)
at is.hail.utils.package$.using(package.scala:635)
at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:59)
at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:339)
at is.hail.backend.spark.SparkBackend.$anonfun$parse_value_ir$1(SparkBackend.scala:709)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.parse_value_ir(SparkBackend.scala:708)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)

I am also aware of hl.import_locus_intervals, but would like to avoid that, because it seems I need to write out a file with the variants first or create a table akin to the result of import_locus_intervals in memory first. Also, this again doesn’t check for matching alleles.

Any help or ideas would be appreciated :slightly_smiling_face:

The NoSuchElementException thing is a bug in Hail. I’ll get that reported.

I think you might be able to go to bigger lists if you use Hail’s struct and Locus equality instead of manually writing out the equality test:

variants = [hl.Struct(locus=hl.Locus("10", 123), alleles=["G", "C"])
            for v in variants]
filters = [v == mt.row_key for v in variants]
mt = mt.filter_rows(reduce(or_, filters))

What version of Hail are you using?

Here’s a way that should scale much better – if you’ve got more than about ten thousand variants, though, probably better to create a table from the variants and do a semi_join:

variants_tuples = [tuple(x) for x in variants] # can't coerce heterogeneous lists into hail exprs
variants = hl.literal(hl.map(lambda t: hl.struct(locus=hl.locus(t[0], t[1], reference_genome='RG'), alleles=[t[2], t[3]])), variants_tuples)
mt_subset = mt.filter_rows(variants.contains(mt.row_key))`

Sorry, both of you, for getting back to this so late. I was on some (hopefully) well-deserved vacations :slightly_smiling_face:

Thanks!

@danking that looks very nice. I tried it out, and stumbled upon some irregularities / possible bugs in Hail. See the following extract from an IPython session:

In [53]: minimt.row_key.show()
+---------------+------------+
| locus         | alleles    |
+---------------+------------+
| locus<GRCh37> | array<str> |
+---------------+------------+
| 10:60515      | ["C","T"]  |
+---------------+------------+

In [54]: vstruct
Out[54]: Struct(locus=Locus(contig=10, position=60515, reference_genome=GRCh37), alleles=['C', 'T'])

In [55]: (vstruct == minimt.row_key)
Out[55]: False

In [56]: (minimt.row_key == vstruct).show()
+---------------+------------+--------+
| locus         | alleles    | <expr> |
+---------------+------------+--------+
| locus<GRCh37> | array<str> |   bool |
+---------------+------------+--------+
| 10:60515      | ["C","T"]  |   True |
+---------------+------------+--------+

In [57]: minimt.filter_rows(vstruct == minimt.row_key).row_key.show()
+---------------+------------+
| locus         | alleles    |
+---------------+------------+
| locus<GRCh37> | array<str> |
+---------------+------------+
+---------------+------------+

In [58]: minimt.filter_rows(minimt.row_key == vstruct).row_key.show()
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
File /app/.venv/lib/python3.10/site-packages/IPython/core/formatters.py:708, in PlainTextFormatter.__call__(self, obj)
    701 stream = StringIO()
    702 printer = pretty.RepresentationPrinter(stream, self.verbose,
    703     self.max_width, self.newline,
    704     max_seq_length=self.max_seq_length,
    705     singleton_pprinters=self.singleton_printers,
    706     type_pprinters=self.type_printers,
    707     deferred_pprinters=self.deferred_printers)
--> 708 printer.pretty(obj)
    709 printer.flush()
    710 return stream.getvalue()

File /app/.venv/lib/python3.10/site-packages/IPython/lib/pretty.py:410, in RepresentationPrinter.pretty(self, obj)
    407                         return meth(obj, self, cycle)
    408                 if cls is not object \
    409                         and callable(cls.__dict__.get('__repr__')):
--> 410                     return _repr_pprint(obj, self, cycle)
    412     return _default_pprint(obj, self, cycle)
    413 finally:

File /app/.venv/lib/python3.10/site-packages/IPython/lib/pretty.py:778, in _repr_pprint(obj, p, cycle)
    776 """A pprint that just redirects to the normal repr function."""
    777 # Find newlines and replace them with p.break_()
--> 778 output = repr(obj)
    779 lines = output.splitlines()
    780 with p.group():

File /app/.venv/lib/python3.10/site-packages/hail/table.py:1492, in Table._Show.__repr__(self)
   1491 def __repr__(self):
-> 1492     return self.__str__()

File /app/.venv/lib/python3.10/site-packages/hail/table.py:1489, in Table._Show.__str__(self)
   1488 def __str__(self):
-> 1489     return self._ascii_str()

File /app/.venv/lib/python3.10/site-packages/hail/table.py:1515, in Table._Show._ascii_str(self)
   1512         return s[:truncate - 3] + "..."
   1513     return s
-> 1515 rows, has_more, dtype = self.data()
   1516 fields = list(dtype)
   1517 trunc_fields = [trunc(f) for f in fields]

File /app/.venv/lib/python3.10/site-packages/hail/table.py:1499, in Table._Show.data(self)
   1497     row_dtype = t.row.dtype
   1498     t = t.select(**{k: hl._showstr(v) for (k, v) in t.row.items()})
-> 1499     rows, has_more = t._take_n(self.n)
   1500     self._data = (rows, has_more, row_dtype)
   1501 return self._data

File /app/.venv/lib/python3.10/site-packages/hail/table.py:1646, in Table._take_n(self, n)
   1644     has_more = False
   1645 else:
-> 1646     rows = self.take(n + 1)
   1647     has_more = len(rows) > n
   1648     rows = rows[:n]

File <decorator-gen-1094>:2, in take(self, n, _localize)

File /app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File /app/.venv/lib/python3.10/site-packages/hail/table.py:2319, in Table.take(self, n, _localize)
   2285 @typecheck_method(n=int, _localize=bool)
   2286 def take(self, n, _localize=True):
   2287     """Collect the first `n` rows of the table into a local list.
   2288
   2289     Examples
   (...)
   2316         List of row structs.
   2317     """
-> 2319     return self.head(n).collect(_localize)

File <decorator-gen-1088>:2, in collect(self, _localize, _timed)

File /app/.venv/lib/python3.10/site-packages/hail/typecheck/check.py:577, in _make_dec.<locals>.wrapper(__original_func, *args, **kwargs)
    574 @decorator
    575 def wrapper(__original_func, *args, **kwargs):
    576     args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577     return __original_func(*args_, **kwargs_)

File /app/.venv/lib/python3.10/site-packages/hail/table.py:2118, in Table.collect(self, _localize, _timed)
   2116 e = construct_expr(rows_ir, hl.tarray(t.row.dtype))
   2117 if _localize:
-> 2118     return Env.backend().execute(e._ir, timed=_timed)
   2119 else:
   2120     return e

File /app/.venv/lib/python3.10/site-packages/hail/backend/py4j_backend.py:104, in Py4JBackend.execute(self, ir, timed)
    102     return (value, timings) if timed else value
    103 except FatalError as e:
--> 104     self._handle_fatal_error_from_backend(e, ir)

File /app/.venv/lib/python3.10/site-packages/hail/backend/backend.py:181, in Backend._handle_fatal_error_from_backend(self, err, ir)
    179 error_sources = ir.base_search(lambda x: x._error_id == err._error_id)
    180 if len(error_sources) == 0:
--> 181     raise err
    183 better_stack_trace = error_sources[0]._stack_trace
    184 error_message = str(err)

File /app/.venv/lib/python3.10/site-packages/hail/backend/py4j_backend.py:98, in Py4JBackend.execute(self, ir, timed)
     96 # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     97 try:
---> 98     result_tuple = self._jbackend.executeEncode(jir, stream_codec, timed)
     99     (result, timings) = (result_tuple._1(), result_tuple._2())
    100     value = ir.typ._from_encoding(result)

File /app/.venv/lib/python3.10/site-packages/py4j/java_gateway.py:1304, in JavaMember.__call__(self, *args)
   1298 command = proto.CALL_COMMAND_NAME +\
   1299     self.command_header +\
   1300     args_command +\
   1301     proto.END_COMMAND_PART
   1303 answer = self.gateway_client.send_command(command)
-> 1304 return_value = get_return_value(
   1305     answer, self.gateway_client, self.target_id, self.name)
   1307 for temp_arg in temp_args:
   1308     temp_arg._detach()

File /app/.venv/lib/python3.10/site-packages/hail/backend/py4j_backend.py:31, in handle_java_exception.<locals>.deco(*args, **kwargs)
     29     tpl = Env.jutils().handleForPython(e.java_exception)
     30     deepest, full, error_id = tpl._1(), tpl._2(), tpl._3()
---> 31     raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
     32 except pyspark.sql.utils.CapturedException as e:
     33     raise FatalError('%s\n\nJava stack trace:\n%s\n'
     34                      'Hail version: %s\n'
     35                      'Error summary: %s' % (e.desc, e.stackTrace, hail.__version__, e.desc)) from None

FatalError: ClassCastException: class org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to class is.hail.variant.Locus (org.apache.spark.sql.catalyst.expressions.GenericRow is in unnamed module of loader 'app'; is.hail.variant.Locus is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @62435e70)

Java stack trace:
java.lang.ClassCastException: class org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to class is.hail.variant.Locus (org.apache.spark.sql.catalyst.expressions.GenericRow is in unnamed module of loader 'app'; is.hail.variant.Locus is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @62435e70)
        at is.hail.expr.JSONAnnotationImpex$.exportAnnotation(AnnotationImpex.scala:124)
        at is.hail.expr.JSONAnnotationImpex$.$anonfun$exportAnnotation$5(AnnotationImpex.scala:129)
        at is.hail.expr.JSONAnnotationImpex$.$anonfun$exportAnnotation$5$adapted(AnnotationImpex.scala:128)
        at scala.collection.generic.GenTraversableFactory.tabulate(GenTraversableFactory.scala:150)
        at is.hail.expr.JSONAnnotationImpex$.exportAnnotation(AnnotationImpex.scala:128)
        at is.hail.types.virtual.Type.toJSON(Type.scala:184)
        at is.hail.expr.JSONAnnotationImpex$.$anonfun$exportAnnotation$4(AnnotationImpex.scala:125)
        at is.hail.utils.Interval.toJSON(Interval.scala:103)
        at is.hail.expr.JSONAnnotationImpex$.exportAnnotation(AnnotationImpex.scala:125)
        at is.hail.expr.JSONAnnotationImpex$.$anonfun$exportAnnotation$1(AnnotationImpex.scala:113)
        at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at scala.collection.TraversableLike.map(TraversableLike.scala:238)
        at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
        at scala.collection.AbstractTraversable.map(Traversable.scala:108)
        at is.hail.expr.JSONAnnotationImpex$.exportAnnotation(AnnotationImpex.scala:113)
        at is.hail.expr.ir.Pretty.header(Pretty.scala:405)
        at is.hail.expr.ir.Pretty.pretty$1(Pretty.scala:463)
        at is.hail.expr.ir.Pretty.$anonfun$sexprStyle$4(Pretty.scala:453)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
        at scala.collection.Iterator$ConcatIterator.next(Iterator.scala:230)
        at is.hail.utils.richUtils.RichIterator$$anon$3.next(RichIterator.scala:67)
        at is.hail.utils.prettyPrint.Doc$.advance$1(PrettyPrintWriter.scala:68)
        at is.hail.utils.prettyPrint.Doc$.render(PrettyPrintWriter.scala:139)
        at is.hail.utils.prettyPrint.Doc.render(PrettyPrintWriter.scala:163)
        at is.hail.utils.prettyPrint.Doc.render(PrettyPrintWriter.scala:167)
        at is.hail.expr.ir.Pretty.sexprStyle(Pretty.scala:466)
        at is.hail.expr.ir.Pretty.apply(Pretty.scala:429)
        at is.hail.expr.ir.Pretty$.apply(Pretty.scala:22)
        at is.hail.expr.ir.Optimize$.apply(Optimize.scala:45)
        at is.hail.expr.ir.lowering.OptimizePass.transform(LoweringPass.scala:30)
        at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
        at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
        at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
        at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
        at is.hail.expr.ir.lowering.OptimizePass.apply(LoweringPass.scala:26)
        at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
        at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
        at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
        at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
        at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:450)
        at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:486)
        at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:70)
        at is.hail.utils.package$.using(package.scala:635)
        at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:70)
        at is.hail.utils.package$.using(package.scala:635)
        at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
        at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:59)
        at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:339)
        at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:483)
        at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
        at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:482)
        at jdk.internal.reflect.GeneratedMethodAccessor60.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:829)



Hail version: 0.2.105-acd89e80c345
Error summary: ClassCastException: class org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to class is.hail.variant.Locus (org.apache.spark.sql.catalyst.expressions.GenericRow is in unnamed module of loader 'app'; is.hail.variant.Locus is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @62435e70)

That asymmetry in the equality test looks concerning to me. Apart from that, this solution (with the “order” in the equality statement that doesn’t lead to a crash) finished (with the wrong result, namely zero resulting filtered variants) in very reasonable time for a list of 300 variants. So that looks promising, assuming we can fix or work around the above-described issues.

@tpoterba your solution seems very slow, at least on my setup. But note that I’m likely using only the driver node / perhaps an additional core node, which might limit scaling. It’s been running for more than ten minutes now for a list of 300 variants and something like a tenth of the chr10 1000G 2013/05/02 release. I’m looking for a total runtime of maybe max a minute or so.

I am using Hail v0.2.105.

Python is doing shenanigans with equality. We should fix it to not fallback on Mapping equality when the rhs is an expression. Not sure about the Row / Locus issue but we’ll look into it

You can follow Hail Locus (and other Python data types) in __eq__, etc. should check if the RHS is a Hail Expression and delegate to the expression instead. · Issue #13045 · hail-is/hail · GitHub and ClassCastException when comparing hl.Locus to row key · Issue #13046 · hail-is/hail · GitHub for updates on these two issues.

Thanks a lot, will do!

In the meantime, I tried out @tpoterba’s suggestion of creating a table with the variants and use mt.semi_join_rows() / ht.semi_join() to subset a (matrix) table to variants in that newly created table. That turns out to work very fast and is also quite readable, and thus solves my original problem :slightly_smiling_face: Thanks a lot for all of your help, great support here as usual :heart:

Hey @simeoncarstens ,

After further digging, I can’t replicate this error exactly. It seems to me that somewhere you’re annotating an interval of loci onto your minimt and that is eventually causing trouble. I think it is just a coincidence that the error only manifests when you filter and show the row key.

Can you share the construction of minimt?