Counting number of variants for each interval

Hi,

I have a list of intervals (the rows in cm_in_genes dataframe), and I’m trying to calculate for each row the number of variants that are found in t. Therefore, I’m iterating over the cm_in_genes table, filtering according to the rows and counting the number of rows after the filtering.

After some iterations, I’m getting this error:
image

Is there another efficient way to do this?
I will appreciate your help :slight_smile:

Thanks in advance,
Ofer

Hey @Ofer ,

You’re creating a huge Hail expression which our compiler is failing to parse. That’s almost certainly the cause of the StackOverflowError [1]. In each loop, you mutate the t variable, so the last iteration is actually applying all your filters. I think what you actually want is this [2]:

for index, row in cm_in_genes.iterrows():
    temp = t.filter(...)
    cm_in_genes.at[index, 'trio_001...'] = temp.count()

The above will only apply the one, correct, filter. However, Hail is not like pandas: the methods on Tables and MatrixTables build a recipe. They do not eagerly compute things. All the computation happens when you call an action function like count, write, or export.

In general, you should fully embrace Hail and express your entire analysis within Hail. That allows Hail to 1) optimize your code across the whole analysis and 2) process the data as few times as possible. Here’s how you can count variants per interval in a more efficient manner:

i = hl.Table.from_pandas(cm_in_genes)
i = i.key_by(interval = hl.locus_interval(
    hl.literal('chr') + hl.str(i.chrom),
    i.CM_start,
    i.CM_end
))
# key_by is a "shuffle" so we need to write and read after it.
i.write("gs://.../my_intervals.ht")
i = hl.read_table("gs://.../my_intervals.ht")

# Hail annoyingly throws away the key after a join, so we
# copy the interval into another field for later use
i = i.annotate(interval_copy = i.interval)
t = t.annotate(the_interval_or_NA = i[t.locus].interval_copy)
t = t.filter(hl.is_defined(t.the_interval_or_NA))
t = t.annotate(the_interval = t.the_interval_or_NA)
t = t.group_by(t.the_interval).aggregate(hl.agg.count())
# save all that hard work to a file
t = t.write("gs://.../count_by_interval.ht")

# peek at the results
t = hl.read_table("gs://.../count_by_inteval.ht")
t.show()

[1] In the future, it’s helpful to include the full stack trace as text in your post, so that we can verify assumptions like this.
[2] In the future, it’s helpful to include the code as text because we can copy it and modify it as necessary.

1 Like

Thanks!

I’m trying to implement the first line -

i = hl.Table.from_pandas(cm_in_genes)

But I’m getting this error:

---------------------------------------------------------------------------
ExpressionException                       Traceback (most recent call last)
/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in raise_for_holes(t)
    239             try:
--> 240                 raise_for_holes(vt)
    241             except ExpressionException as exc:

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in raise_for_holes(t)
    231     if t is None:
--> 232         raise ExpressionException("Hail cannot impute type")
    233     if t in (tbool, tint32, tint64, tfloat32, tfloat64, tstr, tcall):

ExpressionException: Hail cannot impute type

The above exception was the direct cause of the following exception:

ExpressionException                       Traceback (most recent call last)
/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in raise_for_holes(t)
    252         try:
--> 253             raise_for_holes(t.element_type)
    254         except ExpressionException as exc:

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in raise_for_holes(t)
    241             except ExpressionException as exc:
--> 242                 raise ExpressionException(f'cannot impute field {k}') from exc
    243         return

ExpressionException: cannot impute field chrom

The above exception was the direct cause of the following exception:

ExpressionException                       Traceback (most recent call last)
/tmp/ipykernel_13276/3652694852.py in <module>
----> 1 i = hl.Table.from_pandas(cm_in_genes)
      2 # i = i.key_by(interval = hl.locus_interval(
      3 #     hl.literal('chr') + hl.str(i.chrom),
      4 #     i.CM_start,
      5 #     i.CM_end

<decorator-gen-1184> in from_pandas(df, key)

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/table.py in from_pandas(df, key)
   3406                 hl_type_hints[field] = type_hint
   3407 
-> 3408         new_table = hl.Table.parallelize(data, partial_type=hl_type_hints)
   3409         return new_table if not key else new_table.key_by(*key)
   3410 

<decorator-gen-1106> in parallelize(cls, rows, schema, key, n_partitions, partial_type)

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
    575     def wrapper(__original_func, *args, **kwargs):
    576         args_, kwargs_ = check_all(__original_func, args, kwargs, checkers, is_method=is_method)
--> 577         return __original_func(*args_, **kwargs_)
    578 
    579     return wrapper

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/table.py in parallelize(cls, rows, schema, key, n_partitions, partial_type)
    534         if partial_type is not None:
    535             partial_type = hl.tarray(hl.tstruct(**partial_type))
--> 536         rows = to_expr(rows, dtype=dtype, partial_type=partial_type)
    537         if not isinstance(rows.dtype.element_type, tstruct):
    538             raise TypeError("'parallelize' expects an array with element type 'struct', found '{}'"

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in to_expr(e, dtype, partial_type)
    273             raise TypeError("expected expression of type '{}', found expression of type '{}'".format(dtype, e.dtype))
    274         return e
--> 275     return cast_expr(e, dtype, partial_type)
    276 
    277 

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in cast_expr(e, dtype, partial_type)
    279     assert dtype is None or partial_type is None
    280     if not dtype:
--> 281         dtype = impute_type(e, partial_type)
    282     x = _to_expr(e, dtype)
    283     if isinstance(x, Expression):

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in impute_type(x, partial_type)
    128 def impute_type(x, partial_type=None):
    129     t = _impute_type(x, partial_type=partial_type)
--> 130     raise_for_holes(t)
    131     return t
    132 

/cs/labs/michall/ofer.feinstein/my_env/lib/python3.7/site-packages/hail/expr/expressions/base_expression.py in raise_for_holes(t)
    253             raise_for_holes(t.element_type)
    254         except ExpressionException as exc:
--> 255             raise ExpressionException('cannot impute array elements') from exc
    256         return
    257     if isinstance(t, tdict):

ExpressionException: cannot impute array elements

I’m using Hail version 0.2.95-513139587f5…
Do you know the reason for this error?

Thanks!

This means that Hail cannot guess the type of some column of the cm_in_genes pandas DataFrame. Probably a column with only NAs?

If you have this interval list as a UCSC BED file, you can use Hail’s hl.import_bed directly. If it’s some other file, you could try hl.import_table.

Or you could remove the pandas column that are all NAs, NaNs, or empty lists so that Hail can guess the type correctly. If you share the first few rows of the pandas data frame here I can probably identify the problematic column.

1 Like