How can one get a list (set) of unique values from an annotation

Hi,

I’m trying to find out all the values va.vep.transcript_consequences.lof takes in my variant dataset, but I can’t find how.

I think an expression such as (‘va.vep.transcript_consequences.map(t => t.lof).toSet()’) would work, but to what function should I feed it?

Thanks,

Stephane

1 Like

I think you want query_variants here:

csq_set = print(vds.query_variants(
    '''variants
           .flatMap(v => va.vep.transcript_consequences.map(t => t.lof))
           .collectAsSet()'''))
print(csq_set)

Alternatively, you can get out the counts of each unique value:

csqs = print(vds.query_variants(
    '''variants
           .flatMap(v => va.vep.transcript_consequences.map(t => t.lof))
           .counter()'''))
from collections import Counter
print(Counter(csqs).most_common())

Yes, query_variants is what I was looking for, thank you.

I still don’t fully understand how the expressions work though… none of the following (getting the set of chromosomes) work:

vds.query_variants(‘v.map(v => v.contig).collectAsSet()’)
vds.query_variants(’’‘variants.flatMap(v => v.contig).collectAsSet()’’’)

What am I doing wrong?

The expression language is extremely confusing. Query variants exposes one top-level object, variants, an Aggregable. Aggregables are unordered distributed collections of things, like rows or columns of the VDS or its annotation tables.

The most confusing thing about them is that they carry an implicit “scope” around – extra variables you can access for free and can’t map away. In query_variants, the variants aggregable is an Aggregable[Variant] that has v and va in its scope.

Aggregables support ‘aggregator’ operations like count, collect, stats, counter, and more. These functions work on the elements in the aggregable, so usually you’ll need to change the elements with map, filter, and flatMap. The difference between map and flatMap is that map changes elements one-to-one, while flatMap can change the number of elements in the Aggregable because the function supplied returns an array.

For the contigs, you’ll want to use map, not flatMap. if you swap that out in your second line, it’ll work!

The first is incorrect because v is not a top-level variable in query_variants.

1 Like

Thank you, that is very helpful. :slight_smile: