Converting array<array<str>> expression to array<str>

jeji29 · October 19, 2021, 12:15pm

Hello, beginner in hail here!
From the ‘transcript_consequences’ row of vep row:

 transcript_consequences: array<struct {
            allele_num: int32, 
            amino_acids: str, 
            appris: str, 
            biotype: str, 
            cadd_phred: float64, 
            cadd_raw: float64, 
            canonical: int32, 
            ccds: str, 
            cdna_start: int32, 
            cdna_end: int32, 
            cds_end: int32, 
            cds_start: int32, 
            codons: str, 
            consequence_terms: array<str>, 
            distance: int32, 
            domains: array<struct {
                db: str, 
                name: str
            }>

I just wanted to see the consequence_terms, so I annotated it to a new row:

d = d.annotate_rows(consequence_terms= d.vep.transcript_consequences['consequence_terms'])

Now all the value in my new ‘consequence_terms’ row are an array<array> expression, such as [[‘missense_variant’]].
This is causing me a problem, because I want to know whether these samples have a ‘loss of function’ variant.

I have tried:

LoF_mutation = hl.array(['stop_gained', 'frameshift_variant', 'splice_region_variant', 'splice_acceptor_variant','splice_donor_variant', 'missense_variant'])
d = d.annotate_rows(is_LoF = hl.if_else(LoF_mutation.contains(d.consequent_terms), 'Y', 'N'))

But it shows the error
HailException: no conversion found for contains(, array<str>, array<array<str>>) => bool.
Is there any way to convert my array<array> values into a more simple expression, such as just array or a list?
It also seems unnecessary, because all my consequent_terms just have two brackets in a row around them.

Thanks in advance

danking · October 19, 2021, 1:58pm

Hey @jeji29 ,

I’m sorry to hear you’re having trouble with Hail! The direct answer to the question in your post title is flatten:

d.vep.transcript_consequences['consequence_terms']).flatten()

However, I think you actually need to use set intersection here. You have a list of loss of function consequences (LoF_mutation) and you have a list of all the consequences across all the transcripts of this variant (d.vep.transcript_consequences['consequence_terms']). If you want to know if there is at least one loss of function consequence you can use this:

all_consequences = d.vep.transcript_consequences['consequence_terms'].flatten()

d = d.annotate_rows(
    is_LoF = hl.len(LoF_mutation.intersect(hl.set(all_consequences))) != 0
)

I also want to point you at one more resource, the gnomAD Python library. It contains a function called process_consequences which finds the most severe consequence for you:

from gnomad.utils.vep import process_consequences
d = process_consequences(d)
d.vep.worst_consequence_term.describe()

--------------------------------------------------------
Type:
        str
--------------------------------------------------------
Source:
    <hail.table.MatrixTable object at 0x7fc41357c050>
Index:
    ['row']
--------------------------------------------------------

jeji29 · October 20, 2021, 1:06pm

Thanks! The flatten() and intersection() functions solved my problem

Topic		Replies	Views
How to parse CSQ (VEP) field inside Hail 0.2? Hail Query & hailctl	17	2009	February 14, 2023
Help with basic operation Hail Query & hailctl	4	438	March 18, 2020
Transcript_consequences value is missing in matrix table Hail Query & hailctl	21	693	June 30, 2021
Filtering a MatrixTable with VEP annotations Hail Query & hailctl	0	347	August 10, 2020
Compound hets and array<str> to list help Hail Query & hailctl	2	555	May 12, 2020

Converting array<array<str>> expression to array<str>

Related topics