Counting rows in hail table

Hi,

I’m trying to filter VCF files, and after the filtration, I’m trying to count the rows number in order to get the number of variants. Is there another way to count the number of variants? It seems that the count command is very slow…

I’m attaching the code, but you don’t really have to look at it :slight_smile:
Thanks!

interval_table = hl.import_locus_intervals('PATH', reference_genome='GRCh38')
for filename in sorted(os.listdir('PATH')):
    mt = read_mt('PATH' + filename)
    filtered_mt = mt.filter_entries(mt.DP >= 20)
    filtered_mt = filter_mt(interval_table, filtered_mt)
    filtered_mt = filtered_mt.filter_entries((filtered_mt.GT != hl.Call([0,0], phased=True)) & (filtered_mt.GT != hl.Call([0,0])))
    filtered_ht = filtered_mt.make_table()
#     filtered_mt.describe()
    annot_ht = filtered_ht.annotate(gnomAD_AF = gnomAD_ht[filtered_ht.locus, filtered_ht.alleles].info.AF)
    annot_ht = annot_ht.filter((annot_ht.gnomAD_AF[0] < 0.001) & (~hl.is_nan(annot_ht.gnomAD_AF[0])))
    vars_counts = annot_ht.count()
    print(filename+ ' ' + str(vars_counts))

Hail tables and matrix tables are recipes: each method you call adds steps to the recipe. “Action” methods like “count”, “write”, and “collect” actually execute the recipe. This count is executing all the steps in your recipe. MatrixTable.make_table is a fairly inefficient method: it takes the compact vector representation of a Hail MatrixTable and converts it into an tabular form. What are you trying to achieve with make_table?


What do read_mt and filter_mt do? The fastest possible filtering will happen when you read from a Hail Matrix Table format (not a VCF) and when you use a filter_rows command that performs equalities or inequalities on the mt.locus. This is generally faster than joining against a table of locus intervals. We should teach Hail to be smarter, but until then, for small tables of intervals (say, less than 100), you should do this instead:

intervals = interval_table.interval.collect()
mt = mt.key_rows_by('locus')
mt = mt.filter_rows(
    hl.literal(intervals).any(lambda x: x.contains(mt.locus))
)

Where does gnomAD_ht come from?


When you write ~hl.is_nan(...), are you trying to check for missing values in the gnomad_AF array? NaN (not a number) and missing values are distinct in Hail. You should use ~hl.is_missing(...) to check for missing values. NaN only appears when you divide zero by zero or perform similar invalid floating point operations.


When you write

    filtered_mt = filtered_mt.filter_entries((filtered_mt.GT != hl.Call([0,0], phased=True)) & (filtered_mt.GT != hl.Call([0,0])))

it sounds like you’re trying to check for non-homozygotes, is that right? Use .is_hom_ref():

filtered_mt = filtered_mt.filter_entries(~filtered_mt.is_hom_ref())

Hail is designed to run efficiently on all inputs massively in parallel. If you have a list of chromosome-chunked VCF files, you should import them together like this:

vcfs = hl.import_vcf(os.listdir('PATH')
vcfs.write('.../dataset.mt')

Instead of looping over them.


Are you trying to count the total number of homozygotes passing your filter conditions? Try this instead:

mt = hl.read_matrix_table(...)
# filter to regions of interest
intervals = interval_table.interval.collect()
mt = mt.key_rows_by('locus')
mt = mt.filter_rows(
    hl.literal(intervals).any(lambda x: x.contains(mt.locus))
)
# filter to rare gnomad variants
mt = mt.annotate_rows(g = gnomAD_ht[mt.locus, mt.alleles])
mt = mt.filter_rows(hl.all(
    mt.g.AF[0] < 0.001,
    ~hl.is_missing(mt.g.AF[0])
))
# count hom refs matching our filter criteria
result = mt.aggregate_entries(
    hl.agg.count_where(hl.all(
        mt.DP >= 20,
        ~mt.GT.is_hom_ref()
    ))
)

Thank you very much!
This is really helpful.
But, when I’m getting this error when I’m running:

result = ind_mt.aggregate_entries(hl.agg.count_where(hl.all(ind_mt.DP >= 20,~ind_mt.GT.is_hom_ref())))

The error:

---------------------------------------------------------------------------
FatalError                                Traceback (most recent call last)
/tmp/ipykernel_12394/2731096755.py in <module>
      7 # # count hom refs matching our filter criteria
      8 
----> 9 result = ind_mt.aggregate_entries(hl.agg.count_where(hl.all(ind_mt.DP >= 20,~ind_mt.GT.is_hom_ref())))

<decorator-gen-1254> in aggregate_entries(self, expr, _localize)

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/matrixtable.py in aggregate_entries(self, expr, _localize)
   2094         agg_ir = ir.MatrixAggregate(base._mir, expr._ir)
   2095         if _localize:
-> 2096             return Env.backend().execute(agg_ir)
   2097         else:
   2098             return construct_expr(ir.LiftMeOut(agg_ir), expr.dtype)

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
    102             return (value, timings) if timed else value
    103         except FatalError as e:
--> 104             self._handle_fatal_error_from_backend(e, ir)
    105 
    106     async def _async_execute(self, ir, timed=False):

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/backend/backend.py in _handle_fatal_error_from_backend(self, err, ir)
    179         error_sources = ir.base_search(lambda x: x._error_id == err._error_id)
    180         if len(error_sources) == 0:
--> 181             raise err
    182 
    183         better_stack_trace = error_sources[0]._stack_trace

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/backend/py4j_backend.py in execute(self, ir, timed)
     96         # print(self._hail_package.expr.ir.Pretty.apply(jir, True, -1))
     97         try:
---> 98             result_tuple = self._jbackend.executeEncode(jir, stream_codec)
     99             (result, timings) = (result_tuple._1(), result_tuple._2())
    100             value = ir.typ._from_encoding(result)

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/backend/py4j_backend.py in deco(*args, **kwargs)
     29             tpl = Env.jutils().handleForPython(e.java_exception)
     30             deepest, full, error_id = tpl._1(), tpl._2(), tpl._3()
---> 31             raise fatal_error_from_java_error_triplet(deepest, full, error_id) from None
     32         except pyspark.sql.utils.CapturedException as e:
     33             raise FatalError('%s\n\nJava stack trace:\n%s\n'

FatalError: AssertionError: assertion failed

Java stack trace:
java.lang.AssertionError: assertion failed
	at scala.Predef$.assert(Predef.scala:208)
	at is.hail.annotations.ExtendedOrdering$$anon$6.compareNonnull(ExtendedOrdering.scala:190)
	at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
	at is.hail.annotations.ExtendedOrdering$$anon$9.compareIntervalEndpoints(ExtendedOrdering.scala:397)
	at is.hail.annotations.IntervalEndpointOrdering.compareNonnull(ExtendedOrdering.scala:431)
	at is.hail.annotations.ExtendedOrdering.compare(ExtendedOrdering.scala:301)
	at is.hail.annotations.ExtendedOrdering$$anon$8.compare(ExtendedOrdering.scala:385)
	at scala.math.Ordering$$anon$2.compare(Ordering.scala:125)
	at java.base/java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
	at java.base/java.util.TimSort.sort(TimSort.java:234)
	at java.base/java.util.Arrays.sort(Arrays.java:1441)
	at scala.collection.SeqLike.sorted(SeqLike.scala:659)
	at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
	at scala.collection.mutable.ArrayOps$ofRef.sorted(ArrayOps.scala:198)
	at scala.collection.SeqLike.sortBy(SeqLike.scala:634)
	at scala.collection.SeqLike.sortBy$(SeqLike.scala:634)
	at scala.collection.mutable.ArrayOps$ofRef.sortBy(ArrayOps.scala:198)
	at is.hail.utils.Interval$.union(Interval.scala:236)
	at is.hail.expr.ir.ExtractIntervalFilters$.extractAndRewrite(ExtractIntervalFilters.scala:178)
	at is.hail.expr.ir.ExtractIntervalFilters$.extractAndRewrite(ExtractIntervalFilters.scala:181)
	at is.hail.expr.ir.ExtractIntervalFilters$.extractPartitionFilters(ExtractIntervalFilters.scala:281)
	at is.hail.expr.ir.ExtractIntervalFilters$.$anonfun$apply$1(ExtractIntervalFilters.scala:299)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.MapIR$.$anonfun$mapBaseIR$1(MapIR.scala:13)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at is.hail.expr.ir.MapIR$.mapBaseIR(MapIR.scala:13)
	at is.hail.expr.ir.ExtractIntervalFilters$.apply(ExtractIntervalFilters.scala:285)
	at is.hail.expr.ir.Optimize$.$anonfun$apply$4(Optimize.scala:25)
	at is.hail.expr.ir.Optimize$.$anonfun$apply$1(Optimize.scala:18)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.Optimize$.runOpt$1(Optimize.scala:18)
	at is.hail.expr.ir.Optimize$.$anonfun$apply$2(Optimize.scala:25)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.Optimize$.apply(Optimize.scala:22)
	at is.hail.expr.ir.lowering.OptimizePass.transform(LoweringPass.scala:30)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$3(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.$anonfun$apply$1(LoweringPass.scala:16)
	at is.hail.utils.ExecutionTimer.time(ExecutionTimer.scala:81)
	at is.hail.expr.ir.lowering.LoweringPass.apply(LoweringPass.scala:14)
	at is.hail.expr.ir.lowering.LoweringPass.apply$(LoweringPass.scala:13)
	at is.hail.expr.ir.lowering.OptimizePass.apply(LoweringPass.scala:26)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1(LoweringPipeline.scala:15)
	at is.hail.expr.ir.lowering.LoweringPipeline.$anonfun$apply$1$adapted(LoweringPipeline.scala:13)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
	at is.hail.expr.ir.lowering.LoweringPipeline.apply(LoweringPipeline.scala:13)
	at is.hail.expr.ir.CompileAndEvaluate$._apply(CompileAndEvaluate.scala:47)
	at is.hail.backend.spark.SparkBackend._execute(SparkBackend.scala:416)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$2(SparkBackend.scala:452)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$3(ExecuteContext.scala:69)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.backend.ExecuteContext$.$anonfun$scoped$2(ExecuteContext.scala:69)
	at is.hail.utils.package$.using(package.scala:640)
	at is.hail.annotations.RegionPool$.scoped(RegionPool.scala:17)
	at is.hail.backend.ExecuteContext$.scoped(ExecuteContext.scala:58)
	at is.hail.backend.spark.SparkBackend.withExecuteContext(SparkBackend.scala:310)
	at is.hail.backend.spark.SparkBackend.$anonfun$executeEncode$1(SparkBackend.scala:449)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.backend.spark.SparkBackend.executeEncode(SparkBackend.scala:448)
	at jdk.internal.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:834)



Hail version: 0.2.95-513139587f57
Error summary: AssertionError: assertion failed

Thanks
Ofer

This is a bug in Hail. I’ll highlight to the team both for a fix and possible workaround.

Can you share one or two intervals from intervals? I just want to make sure its what I expect.

[Interval(start=Locus(contig=chr1, position=29209, reference_genome=GRCh38), end=Locus(contig=chr1, position=29547, reference_genome=GRCh38), includes_start=True, includes_end=False),
 Interval(start=Locus(contig=chr1, position=629800, reference_genome=GRCh38), end=Locus(contig=chr1, position=630038, reference_genome=GRCh38), includes_start=True, includes_end=False),
 Interval(start=Locus(contig=chr1, position=633869, reference_genome=GRCh38), end=Locus(contig=chr1, position=634124, reference_genome=GRCh38), includes_start=True, includes_end=False),...]

But I see that it does not match the intervals in my file (=in the interval table):

interval_table = hl.import_locus_intervals('PATH', reference_genome='GRCh38')

I’m searching for the first interval (start 29209, end 29574) in my file and I can’t find it… weird…

Thank you so much!

1 Like

Hi,

So what is the fastest / most efficient way to count the number of records in hail table?
It seems like counts() and count_where() both take a lot of time…

Thanks!

table.count() can take a long time if the pipeline it’s executing is expensive. Since Hail is lazy, when you count() or write() transformations above (like variant_qc and subsequent filters) are all executed as part of that query. If count() is taking a long time, it’s probably variant_qc or other expensive aggregations that are slow, not the counting!

Thanks!

This is what I’m doing:
First, I’m filtering my hail table - filtered_ht = ht.filter(ht.gnomad_AF < 0.001)
Then, I’m counting the records - filtered_ht.count()

The first step is very quick, but the second takes a lot of time!
What is the alternative?

Thanks!