Getting info from variant list / using python variables

Hi,

I got the list of variants from a LD matrix:

ld_mat = vds.filter_intervals(Interval.parse('1:START-3M')).ld_matrix()
variants = ld_mat.variant_list()

I’m trying to get va.rsid for each of these variants.

  1. I couldn’t find an efficient way to use, for example, Variant(contig=1, start=904165, ref=G, alts=[AltAllele(ref=G, alt=A)]), which is variants[0], to select a specific variant, and used the following instead:

rsid = vds.filter_variants_expr('v.contig=="1" && v.start==904165 && v.ref=="G" && v.alt=="A"').query_variants('variants.map(v => va.rsid).collect()')[0]

Is there a way to use the variant definition directly, without having to match each of its sub-elements?

  1. My second question is about using python variables in Hail expressions, I don’t see how to do that, an example:

for i,x in enumerate(variants):
chr = x.contig
pos = x.start
ref = x.ref
alt = x.alt()
rsid = vds.filter_variants_expr('v.contig==chr && v.start==pos && v.ref==ref && .alt==alt').query_variants('variants.map(v=> va.rsid).collect()')[0]
print "rsid of variant " + str(i+1) + " is " + rsid

I don’t have time to respond fully but I think I can help get you back on track!

The filter_variants_list method should let you select a specific variant much more easily. It’s also way faster!

Yes, this is what I was looking for. One more quick question though, it doesn’t say (or I didn’t find it) in the documentation, does the output follow the same order as the input variant list?

Nope, the order of results following the collect aggregator is actually not guaranteed (it can change run to run)! The best thing to do is collect the variant too.

For your second question, you can use string interpolation, but keep in mind that every call to query_variants is a separate hail job. Your code will be much more efficient if you can phrase your queries completely in Hail, for example, using filter_variants_list.

1 Like

Thank you both, filter_variants_list and string interpolation allow me to get what I need, though maybe not in the most efficient way:

I didn’t find out how to collect more than one item at a time: vds.filter_variants_list(ldvariants).query_variants('variants.map(v => va.rsid).collect()')
How can I collect v and rsid at once, so I can later match them?

From a list of variants (such as ld_mat.variant_list()), is there a function doing the opposite of variant.parse(), i.e. turning the variant into a chr:start:ref:alt string?

Here is an example of what I am doing, the final goal is to appropriately name the rows and columns of ld_matrix().to_local_matrix().toArray()

vds = hc.read('data/1kg.vds')
tu = ("1","START","3M")
ld_mat = vds.filter_intervals(Interval.parse('{0}:{1}-{2}'.format(*tu))).ld_matrix()
ldvariants = ld_mat.variant_list()
variant_df = vds.filter_variants_list(ldvariants).variants_table().to_pandas()
ids = {}
for r in range(0,len(variant_df.index)):
    vid = str(variant_df.iloc[r]["v.contig"]) + ":" + str(variant_df.iloc[r]["v.start"]) + ":" + str(variant_df.iloc[r]["v.altAlleles"][0][0]) + ":" + str(variant_df.iloc[r]["v.altAlleles"][0][1])
    rsid = variant_df.iloc[r]["va.rsid"]
    ids[vid] = rsid

How can I collect v and rsid at once, so I can later match them?

You can construct structs in the expression language with this syntax: {a: 5, b: "hello"}. So:

variants_dict = { 
  x.v: x.rsid for x in 
  vds.filter_variants_list(ldvariants)\
        .query_variants('variants.map(v => {v: str(v), rsid: va.rsid}).collect()') 
}

From a list of variants (such as ld_mat.variant_list()), is there a function doing the opposite of variant.parse(), i.e. turning the variant into a chr:start:ref:alt string?

I think the __str__ method of Variant produces that. So in either python or the expression language, str(v).

1 Like

Looking at your posted code, I think I can offer an improvement. In 0.1 converting objects from java to python or vice versa is extremely slow. This means that filter_variants_list with a bunch of variants or to_pandas can be pretty terrible.

Instead of using filter_variants_list with the variant list from the LD matrix, can’t you just filter_intervals again? That should have the same set of variants and will be much faster.

And instead of bringing the table down to pandas, use the aggregator I posted above!

1 Like

Oh… I see, very convenient, and much easier than what I was doing, and indeed, thanks to your aggregator structure, I don’t need the table anymore.

str(variant) does indeed produce chr:start:ref:alt ! Fantastic, and so simple :man_facepalming:

As for your point about Java to Python, for what I need to do (regional plots), I need to get the ld matrix into pandas, no way around that, but to construct the dictionary, I could indeed use filter_intervals (some variants are dropped during LD calculations, but having a few extra variants in the dictionary won’t significantly impact the performance). From now on I’ll try to limit the cross between Java and Python as much as I can, thanks. Do you foresee any improvement on that side in a future version, or is it a limitation that can’t be overcome? (Just curious, as I don’t really get how this works)

Just in case I find myself in the situation where I need the variant set to match exactly a list of variants from an interval, would filter_intervals() before filter_variants_list() help?

vds.filter_intervals(Interval.parse('{0}:{1}-{2}'.format(*tu))).filter_variants_list(ldvariants)

One last question, is it possible to get ld_matrix() to output r2 instead of r?

Thanks so much for your help.

Do you foresee any improvement on that side in a future version

Yes! In 0.2, this should be essentially free. What we’re doing now is using py4j to communicate objects over a super slow socket. In 0.2, we’ll use shared memory to pass a reference over that slow socket, requiring only a few bytes of communication and no conversion work.

would filter_intervals() before filter_variants_list() help?

Only to reduce the Python to Java conversion. Both methods are optimized to do O(data kept) work if the keep parameter is True (default).

is it possible to get ld_matrix() to output r2 instead of r?

Not at the moment. We intend to support more operations on large distributed matrices in the future, but for now, just square the values before you plot.

1 Like