How to parse CSQ (VEP) field inside Hail 0.2?

Forgive me for the long post. :sweat:
I want to calculate MAC grouped_by gene (mt.sample_qc.n_singleton()) for several subsets of variants and filtering, I must use specific field (for example “Consequence”, “CANONICAL” and “Gene”) for canonical transcripts. How can I do it inside Hail 0.2? I have tried to hl.annotate_cols() with extracting fields from transcripts, that had passed mt.info.CSQ.filter(), but did not succeed at every stage.
This is a casual string of CSQ (VEP-annotation) for each transcript per variant:

[“C|missense_variant|MODERATE|AGRN|ENSG00000188157|Transcript|ENST0000037…”]

This is my hardcoding logic, that did not fit:

consequence = 1
canonical = 23
gene = 4
correct_transcript = hl.eval(mt.info.CSQ.filter(lambda x: all([hl.eval(hl.eval(x.split(’|’))[consequence]) == ‘missense_variant’, hl.eval(hl.eval(x.split(’|’))[canonical]) == ‘YES’)))
mt = annotate_rows(gene_name = hl.eval(correct_transcript[gene]))
hl.agg.group_by(mt.gene_name, hl.agg.sum(mt.sample_qc.n_singleton))

mt.describe()
----------------------------------------
Row fields:

‘info’: struct {

CSQ: array<str>
}
----------------------------------------
mt.info.CSQ
<ArrayExpression of type array<str>>

  1. The first issue is that it is not clear how to handle the ArrayExpression of StringExpression's.
  2. How to split in this case StringExpression (I suppose) and extract special field by a separator from a string.
  3. It’s not clear how to apply the ArrayExpression.filter() or the alternative way in this case.
  4. Should I create an additional variable (hl.annotate_rows()) with values from parsed CSQ-field necessarily?
  5. How to exclude variants from dataset if correct_transcript doesn’t pass the condition-filter (None or empty [])?
  6. Can I use hl.eval() for a sequence of conversions or I should hardcode nested hl.eval(hl.eval(...)) for each operation?
  7. Maybe it can be done somehow easier?

Dealing with VEP schemas is one of the most difficult problems to solve in Hail (or outside of Hail, I’m sure) so you should not be concerned about having trouble with this!

One of the first points to make is that hl.eval is something that won’t appear in most pipelines that process tables or matrix tables – it’s used in our docs to print the results of simple expressions for teaching purposes, and I think that’s probably a bad idea. Instead, you’ll be passing expressions into annotate_rows, which knows how to apply those transformations to every row of a table / MatrixTable.

Here’s my attempt to write the VEP parser you’ve described:

consequence = 1
canonical = 23
gene = 4
split_csq = mt.info.CSQ  # it looks like this is already an array?
split_transcripts = split_csq.map(lambda x: x.split('|'))

kept_transcripts = split_transcripts.filter(
    lambda tx: (tx[consequence] == 'missense_variant') & (tx[canonical] == 'YES'))

mt = mt.annotate_rows(kept_transcripts = kept_transcripts)
# there can be more than one canonical transcript. If you want each to be treated separately, we can use explode. Otherwise, can use `head()`
mt = mt.explode_rows(mt.kept_transcripts) # option 1
mt = mt.annotate_rows(kept_transcripts = mt.kept_transcripts.head())) # option 2

# it looks like you want number of singleton variants per gene. This is just an aggregation of the rows.
mt_rows = mt.rows()
gene_ht = mt_rows.group_by(mt_rows.kept_transcripts.gene) \
    .aggregate(n_singleton = hl.agg.count_where(mt_rows.variant_qc.AC[1] == 1))

Thank you!

  1. I assume that the “split” of hail supports regex, so we should use escaping of special symbols here:
    split_csq.map(lambda x: x.split('\|')).
  2. How to omit empty lists, which I got from split_transcripts.filter() ?
locus	alleles	
locus<GRCh37>	array<str>	array<array<str>>
1:957584	["C","A"]	[]
1:957593	["C","T"]	[["T","missense_variant","MODERATE","AGRN","ENSG00000188157","Transcript"...
1:957604	["G","A"]	[]
1:957624	["G","A"]	[["A","missense_variant","MODERATE","AGRN","ENSG00000188157","Transcript"...
1:957633	["T","C"]	[["C","missense_variant","MODERATE","AGRN","ENSG00000188157","Transcript"...
  1. I suppose mt_rows.kept_transcripts doesn’t contain “gene” field, and I need to annotate_rows() before.

You may find the overview a more accessible and broad introduction to Hail than the reference documentation.

  1. Yes, the pipe should be escaped.
  2. Table.filter removes rows of a table, try filtering on hl.len(table.kept_transcripts[0]) > 0
  3. kept_transcipts is a field of a MatrixTable, or, after you call .rows(), a field of a Table, but as you’ve noticed it’s an array of arrays of strings. If these inner arrays have interesting structure rather than being arbitrary length collections of values, then you should consider converting the inner arrays to Hail Structs:
t.annotate(kept_transcripts=t.kept_transcripts.map(lambda x:
    hl.struct(alt=x[0], type=x[1], severity=x[2], gene=x[3], tid=x[4], ...)))

It’s still not clear for me how I should apply hl.len(table.kept_transcripts[0]) > 0 to ArrayExpression of ArrayExpressions.

# I made a bool-mask for next filtering:
(hl.len(kept_transcripts) > 0).show(5)

locus	alleles	
locus<GRCh37>	array<str>	bool
1:957624	["G","A"]	false
1:957640	["C","T"]	true
1:957677	["C","T"]	false
1:957742	["C","T"]	true
1:957743	["C","G"]	false

# Even so, I don't know how to apply it. I tried to use your solution,
# but I still get unfiltered data:
kept_transcripts.filter(lambda tx: hl.len(tx) > 0)

locus	alleles	
locus<GRCh37>	array<str>	array<array<str>>
1:957624	["G","A"]	[]
1:957640	["C","T"]	[["T","synonymous_variant","LOW","AGRN","ENSG00000188157","Transcript","E...
1:957677	["C","T"]	[]
1:957742	["C","T"]	[["T","synonymous_variant","LOW","AGRN","ENSG00000188157","Transcript","E...
1:957743	["C","G"]	[]

OK, so, Tim used a somewhat advanced feature of Hail: expressions. I think the subtle difference between expressions and MatrixTables/Tables is causing confusion.

An expression is a hail value or a combination of hail values. Some simple examples:

x = hl.literal(3) # literal converts a python value to a hail value
a_str = hl.literal("abc")
y = x * 2 # python values are automatically converted in many cases
len_of_a_str = hl.len(a_str)

A key aspect of Hail is that expressions can be keyed. An keyed expression is an element of a Table or MatrixTable.

t = hl.utils.range_table(10) # a table with a field idx with values 0 to 9
t # a Table
t.idx # a Table-keyed expression

You can show a Table or an expression. The difference is more clear when a Table has many columns:

import hail as hl
t = hl.utils.range_table(10) # a table with a field idx with values 0 to 9
t = t.annotate(idx_sqr = t.idx * t.idx,
               idx_cube = t.idx * t.idx * t.idx)
t.show()
t.idx.show()
t.idx_sqr.show()
(t.idx_sqr < 9).show()
t.describe()

Run the above. You’ll see that Table.show shows all the fields on a table. Expression.show shows the expressed value and the key, if it exists. t.idx is its own key, so it is printed alone. t.idx_sqr's key is t.idx, so both are printed. Same for the boolean expression.

MatrixTables generalize a Table into two dimensions: there is both a row-key and a column-key. Row-keyed values are called “row fields”. Column-keyed values are called “column fields”. Row and column keyed values are called “entry fields”.

mt = hl.utils.range_matrix_table(3, 3)
mt = mt.annotate_entries(product = mt.row_idx * mt.col_idx,
                         sum = mt.row_idx + mt.col_idx)
mt = mt.annotate_rows(row_idx_sqr = mt.row_idx * mt.row_idx)
mt = mt.annotate_cols(col_idx_sqr = mt.col_idx * mt.col_idx)
mt.show()
mt.col_idx_sqr.show()
mt.row_idx_sqr.show()
mt.product.show()
(mt.product == mt.sum).show()

Run the above. You’ll see again how expressions with a variety of keys are displayed.


Ok, let’s rewrite Tim’s code to avoid saving expressions in python variables. We’ll only save table or matrix tables in python variables. I think this will avoid confusion.

consequence = 1
canonical = 23
gene = 4

mt = mt.annotate_rows(kept_transcripts =
    mt.info.CSQ.map(
        lambda x: x.split('|')
    ).filter(
        lambda tx: (tx[consequence] == 'missense_variant') & (tx[canonical] == 'YES')
    )
)
# there can be more than one canonical transcript. If you want each to be treated separately, we can use explode. Otherwise, can use `head()`
mt = mt.explode_rows(mt.kept_transcripts) # option 1
mt = mt.annotate_rows(kept_transcripts = mt.kept_transcripts.head())) # option 2

# it looks like you want number of singleton variants per gene. This is just an aggregation of the rows.
mt_rows = mt.rows()
gene_ht = mt_rows.group_by(mt_rows.kept_transcripts.gene) \
    .aggregate(n_singleton = hl.agg.count_where(mt_rows.variant_qc.AC[1] == 1))

OK, so, let’s remove the last line and replace it with this:

mt_rows = mt_rows.filter(hl.len(mt_rows.kept_transcripts) >= 1)
mt_rows = mt_rows.annotate(kept_transcripts =
    mt_rows.kept_transcripts.map(
        lambda x: hl.struct(alt=x[0], type=x[1], severity=x[2], gene=x[3], tid=x[4], rest=x[5:])
    )
)

I filter to rows that have at least one transcript. Then I convert each transcript in the kept_transcripts array (a field of the Table mt_rows) to a struct that has some friendly names for its elements. All the elements that I chose not to name I kept as an array and gave the name rest.

Ok, one more thing, there can be multiple transcripts per locus and alleles because genes can overlap. So let’s copy every row for each transcript it has, converting the kept_transcripts array-of-structs field to a struct field.

mt_rows = mt_rows.explode(mt_rows.kept_transcripts)

Now we can group_by:

gene_ht = mt_rows.group_by(mt_rows.kept_transcripts.gene) \
    .aggregate(n_singleton = hl.agg.count_where(mt_rows.variant_qc.AC[1] == 1))

Thank you! Everything is working fine!
I have a last optional issue.
Can I make one object (mt) which contain several different results of filtering? For example:

  1. kept_transcripts_synonymic
  2. kept_transcripts_missense,
  3. kept_transcripts_LoF

yes, absolutely - you can add multiple lines as shown here:

mt = mt.annotate_rows(kept_transcripts_synonymous =
    mt.info.CSQ.map(
        lambda x: x.split('|')
    ).filter(
        lambda tx: (tx[consequence] == 'synonymous') & (tx[canonical] == 'YES')
    ),
    kept_transcripts_missense =
    mt.info.CSQ.map(
        lambda x: x.split('|')
    ).filter(
        lambda tx: (tx[consequence] == 'synonymous') & (tx[canonical] == 'YES')
    ),
    ...
)