Subtring match in array<str> info column

I have a MT that looks something like

+---------------+---------------+----------------------+
| locus         | alleles       | info.CLNDISDB        |
+---------------+---------------+----------------------+
| locus<GRCh37> | array<str>    | array<str>           |
+---------------+---------------+----------------------+
| 2:47630512    | ["A","AG"]    | ["A", "A|B"]         |
| 2:47690234    | ["T","TAATG"] | ["B"]                |
| 2:47693860    | ["T","TA"]    | ["A"]                |
| 2:47705430    | ["TTAA","T"]  | ["B|C", "C|D"]       |
| 2:48026310    | ["C","CTA"]   | ["B", "A|C"]         |
+---------------+---------------+----------------------+

I want to filter all the rows that contain the string 'B'. My current query uses the .contains() function as follows:

mt.filter_rows(
    (~hl.is_missing(mt.info['CLNDISDB'])) &
    (mt.info['CLNDISDB'].contains('B'))
)

But I only get rows 2 and 5 when I want to get all 4 rows that matches 'B' somehwere in the info (i.e. rows 1, 2, 4 and 5). Is there a way to match within the .contains() function? Say a substring or regex match into a string array expression?

So you don’t want to be calling the ArrayExpression.contains method, which is checking if any element of the array exactly matches. You want to be calling the StringExpression.contains method on each element of the array. I think: mt.info['CLNDISDB'].any(lambda element: element.contains("B") should do it.

1 Like

Yeah, this is exactly what I need for this particular use case. I’m guessing we can also apply some form of regex on the strings too right? Instead of just substring matches?

Yup, instead of StringExpression.contains you can use StringExpression.matches. See: Hail | StringExpression