Getting info from variant list / using python variables

Stephane_Bourgeois · October 24, 2017, 11:23am

Hi,

I got the list of variants from a LD matrix:

ld_mat = vds.filter_intervals(Interval.parse('1:START-3M')).ld_matrix()
variants = ld_mat.variant_list()

I’m trying to get va.rsid for each of these variants.

I couldn’t find an efficient way to use, for example, Variant(contig=1, start=904165, ref=G, alts=[AltAllele(ref=G, alt=A)]), which is variants[0], to select a specific variant, and used the following instead:

rsid = vds.filter_variants_expr('v.contig=="1" && v.start==904165 && v.ref=="G" && v.alt=="A"').query_variants('variants.map(v => va.rsid).collect()')[0]

Is there a way to use the variant definition directly, without having to match each of its sub-elements?

My second question is about using python variables in Hail expressions, I don’t see how to do that, an example:

for i,x in enumerate(variants):
chr = x.contig
pos = x.start
ref = x.ref
alt = x.alt()
rsid = vds.filter_variants_expr('v.contig==chr && v.start==pos && v.ref==ref && .alt==alt').query_variants('variants.map(v=> va.rsid).collect()')[0]
print "rsid of variant " + str(i+1) + " is " + rsid

tpoterba · October 24, 2017, 2:03pm

I don’t have time to respond fully but I think I can help get you back on track!

The filter_variants_list method should let you select a specific variant much more easily. It’s also way faster!

Stephane_Bourgeois · October 24, 2017, 2:39pm

Yes, this is what I was looking for. One more quick question though, it doesn’t say (or I didn’t find it) in the documentation, does the output follow the same order as the input variant list?

tpoterba · October 24, 2017, 3:03pm

Nope, the order of results following the collect aggregator is actually not guaranteed (it can change run to run)! The best thing to do is collect the variant too.

danking · October 24, 2017, 6:52pm

For your second question, you can use string interpolation, but keep in mind that every call to query_variants is a separate hail job. Your code will be much more efficient if you can phrase your queries completely in Hail, for example, using filter_variants_list.

Stephane_Bourgeois · October 25, 2017, 11:30am

Thank you both, filter_variants_list and string interpolation allow me to get what I need, though maybe not in the most efficient way:

I didn’t find out how to collect more than one item at a time: vds.filter_variants_list(ldvariants).query_variants('variants.map(v => va.rsid).collect()')
How can I collect v and rsid at once, so I can later match them?

From a list of variants (such as ld_mat.variant_list()), is there a function doing the opposite of variant.parse(), i.e. turning the variant into a chr:start:ref:alt string?

Here is an example of what I am doing, the final goal is to appropriately name the rows and columns of ld_matrix().to_local_matrix().toArray()

vds = hc.read('data/1kg.vds')
tu = ("1","START","3M")
ld_mat = vds.filter_intervals(Interval.parse('{0}:{1}-{2}'.format(*tu))).ld_matrix()
ldvariants = ld_mat.variant_list()
variant_df = vds.filter_variants_list(ldvariants).variants_table().to_pandas()
ids = {}
for r in range(0,len(variant_df.index)):
    vid = str(variant_df.iloc[r]["v.contig"]) + ":" + str(variant_df.iloc[r]["v.start"]) + ":" + str(variant_df.iloc[r]["v.altAlleles"][0][0]) + ":" + str(variant_df.iloc[r]["v.altAlleles"][0][1])
    rsid = variant_df.iloc[r]["va.rsid"]
    ids[vid] = rsid

tpoterba · October 25, 2017, 11:53am

How can I collect v and rsid at once, so I can later match them?

You can construct structs in the expression language with this syntax: {a: 5, b: "hello"}. So:

variants_dict = { 
  x.v: x.rsid for x in 
  vds.filter_variants_list(ldvariants)\
        .query_variants('variants.map(v => {v: str(v), rsid: va.rsid}).collect()') 
}

From a list of variants (such as ld_mat.variant_list()), is there a function doing the opposite of variant.parse(), i.e. turning the variant into a chr:start:ref:alt string?

I think the __str__ method of Variant produces that. So in either python or the expression language, str(v).

tpoterba · October 25, 2017, 11:59am

Looking at your posted code, I think I can offer an improvement. In 0.1 converting objects from java to python or vice versa is extremely slow. This means that filter_variants_list with a bunch of variants or to_pandas can be pretty terrible.

Instead of using filter_variants_list with the variant list from the LD matrix, can’t you just filter_intervals again? That should have the same set of variants and will be much faster.

And instead of bringing the table down to pandas, use the aggregator I posted above!

Stephane_Bourgeois · October 25, 2017, 12:47pm

Oh… I see, very convenient, and much easier than what I was doing, and indeed, thanks to your aggregator structure, I don’t need the table anymore.

str(variant) does indeed produce chr:start:ref:alt ! Fantastic, and so simple

As for your point about Java to Python, for what I need to do (regional plots), I need to get the ld matrix into pandas, no way around that, but to construct the dictionary, I could indeed use filter_intervals (some variants are dropped during LD calculations, but having a few extra variants in the dictionary won’t significantly impact the performance). From now on I’ll try to limit the cross between Java and Python as much as I can, thanks. Do you foresee any improvement on that side in a future version, or is it a limitation that can’t be overcome? (Just curious, as I don’t really get how this works)

Just in case I find myself in the situation where I need the variant set to match exactly a list of variants from an interval, would filter_intervals() before filter_variants_list() help?

vds.filter_intervals(Interval.parse('{0}:{1}-{2}'.format(*tu))).filter_variants_list(ldvariants)

One last question, is it possible to get ld_matrix() to output r2 instead of r?

Thanks so much for your help.

tpoterba · October 25, 2017, 1:38pm

Do you foresee any improvement on that side in a future version

Yes! In 0.2, this should be essentially free. What we’re doing now is using py4j to communicate objects over a super slow socket. In 0.2, we’ll use shared memory to pass a reference over that slow socket, requiring only a few bytes of communication and no conversion work.

would filter_intervals() before filter_variants_list() help?

Only to reduce the Python to Java conversion. Both methods are optimized to do O(data kept) work if the keep parameter is True (default).

is it possible to get ld_matrix() to output r2 instead of r?

Not at the moment. We intend to support more operations on large distributed matrices in the future, but for now, just square the values before you plot.

Topic		Replies	Views
Filtering for a list of rsids Help [0.1]	2	992	August 21, 2018
Is there a more computationally efficient way to get a single variant from VDS? Hail Query & hailctl	3	256	March 27, 2024
Subset (matrix) table to a medium-sized list of variants Hail Query & hailctl	8	798	July 10, 2023
Select specific variants by locus Hail Query & hailctl	2	540	February 1, 2023
Extract specific variants by rsID Hail Query & hailctl	2	813	March 28, 2019

Getting info from variant list / using python variables

Related topics