Find an item in HailTable by a string representation of a variant

I have a HailTable where rows are keyed by locus and alleles fields:

In [37]: cidr.describe()                                                                                                                                       
----------------------------------------
Global fields:
    'sourceFilePath': str 
    'genomeVersion': str 
----------------------------------------
Row fields:
    'locus': locus<GRCh38> 
    'alleles': array<str> 
    'rsid': str 
    'qual': float64 
    'filters': set<str> 
    'info': struct {
        AC: array<int32>, 
        AF: array<float64>, 
        AN: int32
    } 
    'original_alt_alleles': array<str> 
    'a_index': int32 
    'was_split': bool 
----------------------------------------
Key: ['locus', 'alleles']
----------------------------------------

I want to select a row given a string ‘20-42077813-A-ATTT’, how to do that?

I tried the following but it does not work:

pos = hl.Locus.parse('20:42077813', reference_genome='GRCh38')
cidr_problem = cidr[hl.struct(locus = pos, alleles = hl.literal(['A', 'ATTT']))].info.AC

giving me ‘ExpressionException: Cannot index with a scalar expression’ error

You can’t look up a single element of a Table. You could filter to that single element, then collect the resulting table.

And how could I filter it to achieve the ultimate result?

final_results = cidr.filter(cidr.locus == pos).collect()
final_result = final_results[0] # Assuming there's only one

The first line - final_results = cidr.filter(cidr.locus == pos).collect() - gives me:

FatalError: AssertionError: assertion failed

What’s the rest of the error?

In [45]: final_results = cidr.filter(cidr.locus == pos).collect()                                                                                              
---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
<ipython-input-45-d80e1c9ab6a1> in <module>
----> 1 final_results = cidr.filter(cidr.locus == pos).collect()

</opt/seqr/.conda/envs/py37/lib/python3.7/site-packages/decorator.py:decorator-gen-1102> in collect(self, _localize)

~/.conda/envs/py37/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

~/.conda/envs/py37/lib/python3.7/site-packages/hail/table.py in collect(self, _localize)
   1918         e = construct_expr(rows_ir, hl.tarray(t.row.dtype))
   1919         if _localize:
-> 1920             return Env.backend().execute(e._ir)
   1921         else:
   1922             return e

~/.conda/envs/py37/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     96                 raise HailUserError(message_and_trace) from None
     97 
---> 98             raise e

~/.conda/envs/py37/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     72         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     73         try:
---> 74             result = json.loads(self._jhc.backend().executeJSON(jir))
     75             value = ir.typ._from_json(result['value'])
     76             timings = result['timings']

~/.conda/envs/py37/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

~/.conda/envs/py37/lib/python3.7/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     30                 raise FatalError('%s\n\nJava stack trace:\n%s\n'
     31                                  'Hail version: %s\n'
---> 32                                  'Error summary: %s' % (deepest, full, hail.__version__, deepest), error_id) from None
     33         except pyspark.sql.utils.CapturedException as e:
     34             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: AssertionError: assertion failed

Java stack trace:
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:156)
	at is.hail.variant.ReferenceGenome$.compare(ReferenceGenome.scala:700)
	at is.hail.variant.ReferenceGenome$.compare(ReferenceGenome.scala:707)
	at is.hail.variant.ReferenceGenome$$anon$1.compare(ReferenceGenome.scala:112)
	at is.hail.variant.ReferenceGenome$$anon$1.compare(ReferenceGenome.scala:111)
	at is.hail.annotations.ExtendedOrdering$$anon$3.compareNonnull(ExtendedOrdering.scala:11)
	at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
	at is.hail.rvd.PartitionBoundOrdering$$anon$2.compareNonnull(PartitionBoundOrdering.scala:42)
	at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
	at is.hail.rvd.PartitionBoundOrdering$$anon$2$$anon$1.compareIntervalEndpoints(PartitionBoundOrdering.scala:121)
	at is.hail.annotations.IntervalEndpointOrdering.compareNonnull(ExtendedOrdering.scala:431)
	at is.hail.annotations.ExtendedOrdering.ltNonnull(ExtendedOrdering.scala:282)
	at is.hail.annotations.ExtendedOrdering.gtNonnull(ExtendedOrdering.scala:288)
	at is.hail.annotations.ExtendedOrdering.gt(ExtendedOrdering.scala:351)
	at is.hail.rvd.RVDPartitioner$$anonfun$6.apply(RVDPartitioner.scala:163)
	at is.hail.rvd.RVDPartitioner$$anonfun$6.apply(RVDPartitioner.scala:163)
	at scala.collection.IndexedSeqOptimized$$anonfun$indexWhere$1.apply(IndexedSeqOptimized.scala:204)
	at scala.collection.IndexedSeqOptimized$$anonfun$indexWhere$1.apply(IndexedSeqOptimized.scala:204)
	at scala.collection.IndexedSeqOptimized$class.segmentLength(IndexedSeqOptimized.scala:195)
	at scala.collection.mutable.ArrayBuffer.segmentLength(ArrayBuffer.scala:48)
	at scala.collection.IndexedSeqOptimized$class.indexWhere(IndexedSeqOptimized.scala:204)
	at scala.collection.mutable.ArrayBuffer.indexWhere(ArrayBuffer.scala:48)
	at is.hail.rvd.RVDPartitioner.is$hail$rvd$RVDPartitioner$$firstPast$1(RVDPartitioner.scala:163)
	at is.hail.rvd.RVDPartitioner$$anonfun$7.apply(RVDPartitioner.scala:167)
	at is.hail.rvd.RVDPartitioner$$anonfun$7.apply(RVDPartitioner.scala:166)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
	at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
	at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
	at is.hail.rvd.RVDPartitioner.subdivide(RVDPartitioner.scala:166)
	at is.hail.rvd.RVDPartitioner$.generate(RVDPartitioner.scala:397)
	at is.hail.rvd.RVDPartitioner$.generate(RVDPartitioner.scala:385)
	at is.hail.rvd.RVDPartitioner.extendKey(RVDPartitioner.scala:133)
	at is.hail.rvd.IndexedRVDSpec2.read(AbstractRVDSpec.scala:467)
	at is.hail.expr.ir.RVDComponentSpec.read(AbstractMatrixTableSpec.scala:112)
	at is.hail.expr.ir.TableNativeReader.apply(TableIR.scala:882)
	at is.hail.expr.ir.TableRead.execute(TableIR.scala:1143)
	at is.hail.expr.ir.TableKeyBy.execute(TableIR.scala:1255)
	at is.hail.expr.ir.Interpret$.run(Interpret.scala:813)
	at is.hail.expr.ir.Interpret$.alreadyLowered(Interpret.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$.interpretAndCoerce$1(InterpretNonCompilable.scala:16)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:53)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at is.hail.expr.ir.InterpretNonCompilable$.rewriteChildren$1(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:54)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at is.hail.expr.ir.InterpretNonCompilable$.rewriteChildren$1(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:54)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at is.hail.expr.ir.InterpretNonCompilable$.rewriteChildren$1(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:54)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at is.hail.expr.ir.InterpretNonCompilable$.rewriteChildren$1(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:54)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at is.hail.expr.ir.InterpretNonCompilable$.rewriteChildren$1(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:54)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at is.hail.expr.ir.InterpretNonCompilable$.rewriteChildren$1(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:54)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$$anonfun$1.apply(InterpretNonCompilable.scala:25)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at is.hail.expr.ir.InterpretNonCompilable$.rewriteChildren$1(InterpretNonCompilable.scala:25)
	at is.hail.expr.ir.InterpretNonCompilable$.is$hail$expr$ir$InterpretNonCompilable$$rewrite$1(InterpretNonCompilable.scala:54)
	at is.hail.expr.ir.InterpretNonCompilable$.apply(InterpretNonCompilable.scala:58)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.transform(LoweringPass.scala:67)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3$$anonfun$1.apply(LoweringPass.scala:15)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:15)
	at is.hail.expr.ir.lowering.LoweringPass$$anonfun$apply$3.apply(LoweringPass.scala:13)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass$class.apply(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.InterpretNonCompilablePass$.apply(LoweringPass.scala:62)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:14)
	at is.hail.expr.ir.lowering.LoweringPipeline$$anonfun$apply$1.apply(LoweringPipeline.scala:12)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:12)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:28)
	at is.hail.backend.spark.SparkBackend.is$hail$backend$spark$SparkBackend$$_execute(SparkBackend.scala:362)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:346)
	at is.hail.backend.spark.SparkBackend$$anonfun$execute$1.apply(SparkBackend.scala:343)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:48)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1$$anonfun$apply$1.apply(ExecuteContext.scala:48)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:48)
	at is.hail.expr.ir.ExecuteContext$$anonfun$scoped$1.apply(ExecuteContext.scala:47)
	at is.hail.utils.package$.using(package.scala:618)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:13)
	at is.hail.expr.ir.ExecuteContext$.scoped(ExecuteContext.scala:47)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:256)
	at is.hail.backend.spark.SparkBackend.execute(SparkBackend.scala:343)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:387)
	at is.hail.backend.spark.SparkBackend$$anonfun$7.apply(SparkBackend.scala:385)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeJSON(SparkBackend.scala:385)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.63-cb767a7507c8
Error summary: AssertionError: assertion failed

Hey @NLSVTN , sorry you’re running into this assertion error. That’s a bad error message. It seems like at least one locus has an invalid chromosome.

From where did you get cidr? If you run cidr.write('/tmp/foo.t'), do you also get an error message?

It does not work not only for cidr, but for other datasets too. I get ‘connection refused’ error for the write operation. The datasets I am just importing using various functions: import_vcf, read_table, read_matrix_table, etc. So, is it because there is no such position in the dataset?

Every time you write a dataset you get a connection refused error? Are you using Hail in Google Dataproc, on a laptop, on a virtual machine, or on a custom Spark cluster?

I’m not sure what the problem is. What happens if you run cidr._force_count()?

I am just using ipython on a server. Seems so, I never write it since no need to. It just counts and returns some number: 638372.

Can you share the full script you ran that triggers the assertion error?

cidr = hl.read_table('file:///cidr_batch1_and_batch2_cleaned.ht') 
pos = hl.Locus.parse('20:42077813', reference_genome='GRCh38')
final_results = cidr.filter(cidr.locus == pos).collect()

Thats it. But I actually need to get a specific allele, not just everything at position 20:42077813.

Regarding matching to a specific variant, you’re probably looking for hl.parse_variant

var = hl.parse_variant('20:42077813:A:ATTT')
cidr.filter((cidr.locus = var.locus) & (cidr.alleles = var.alleles)).collect()

Your original attempt would also work if you changed like this:

pos = hl.Locus.parse('20:42077813', reference_genome='GRCh38')
cidr.filter(
    (cidr.locus = var.locus) & (cidr.alleles = hl.literal(['A', 'ATTT']))
).collect()

I’m still not sure why you’re getting the AssertionError. I’ll ask someone from the compiler team to improve the error message so that we can get more useful information next time.

1 Like

This line - cidr.filter((cidr.locus = var.locus) & (cidr.alleles = var.alleles)).collect() - gives me SyntaxError: invalid syntax:

In [8]: cidr.filter((cidr.locus = var.locus) & (cidr.alleles = var.alleles)).collect()                                                                                 
  File "<ipython-input-8-3254182d042d>", line 1
    cidr.filter((cidr.locus = var.locus) & (cidr.alleles = var.alleles)).collect()
                            ^
SyntaxError: invalid syntax

In the second case you mean pos.locus, not var.locus, right?

Ok, just after changing = to == it started to work. But I faced another issue with the following code:

var = hl.parse_variant('20:42077813:A:ATTT', reference_genome='GRCh38')
cidr.filter((cidr.locus == var.locus) & (cidr.alleles == var.alleles)).collect() 

Error summary: HailException: Invalid locus '20:42077813' found. Contig '20' is not in the reference genome 'GRCh38'

Ok, parse_variant example works well with GRCh38, just needed to prepend chr:

var = hl.parse_variant('chr20:42077813:A:ATTT', reference_genome='GRCh38')
cidr.filter((cidr.locus == var.locus) & (cidr.alleles == var.alleles)).collect()

With pos needed to do some corrections too:

pos = hl.Locus.parse('chr20:42077813', reference_genome='GRCh38')
cidr.filter(
    (cidr.locus == pos) & (cidr.alleles == hl.literal(['A', 'ATTT']))
).collect()

Thank you very much!

The Connection Refused error was happening because there are issues with my hadoop cluster and it is trying to write to it. How could I write locally? I tried using file:/// prefix but it does not seem to work, e.g.:

mt.a_index.export('file:///mt_aIndex.tsv')

The file:// prefix is what I expect to work. If you’re using a Hadoop cluster, then your workers are presumably on different machines from the leader machine. That means each worker is writing to their own local filesystem. That’s probably not what you want.

Currently I just launched ipython. I can read in the file with file:/// but I can’t export it with the same prefix somehow.