[Breaking Change] filter_samples_list now takes a list

tpoterba · April 7, 2017, 11:26pm

As title. See the new docs: https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.filter_samples_list

Andrea_Ganna · April 10, 2017, 6:52pm

Would be nice to add in the help an example regarding how to read a file, pass it to keytable, transform it in list and finally filter using filter_samples_list
For example:

sdrop=hc.import_keytable(stroot + 'samples_to_drop.txt',
                           config=hail.TextTableConfig(noheader=True)).key_by('_0')

sdropL = [item._0 for item in sdrop.collect()]

vds.filter_samples_list(sdropL, keep=False)

tpoterba · April 10, 2017, 8:37pm

Why go through keytable?

samples = []
with open(filename, 'r') as f:
    for line in f:
        samples.append(line.strip())

vds = vds.filter_samples_list(samples)

Ruth_Mitchell · May 17, 2018, 3:06pm

Has this been put into version 0.2? Or is this just a case of filtering columns as a bgen file now imports as a MatrixTable? In which case I can’t find how to filter columns based on a text file/list.
Also, the same question holds for filtering variants. filter_samples_list() and
filter_variants_table() were the two methods I was really hoping to use in Hail!

tpoterba · May 17, 2018, 3:17pm

The easiest way to do both of these is going to be by importing to a table.

Suppose I have a file that looks like this:

NA12878
NA12891
NA12892
...

And a file that looks like this:

Variant
1:1:A:T
1:5:C:CC
...

Then the easiest way to keep these samples and these sites is going to be:

sample_table = hl.import_table(sample_file, no_header=True, key='f0')
variant_table = hl.import_table(variant_file)

# parse the chr:pos:ref:alt to locus / alleles fields
variant_table = variant_table.key_by(**hl.parse_variant(variant_table['Variant']))

# filter to samples in the table
mt = mt.filter_cols(hl.is_defined(sample_table[mt.col_key]))

# filter to variants in the table
mt = mt.filter_rows(hl.is_defined(variant_table[mt.row_key]))

Ruth_Mitchell · May 17, 2018, 4:12pm

ok, yep, that makes sense.

My variants are a list of rsids and I’m running into the error:
TypeError: key_by() got an unexpected keyword argument ‘locus’

I’ve tried changing mt key to ‘rsid’ so that they match but this shuts down the SparkContext! (Error summary: SparkException: Job 3 cancelled because SparkContext was shut down)

With the samples, when I tried to filter them, I get a Py4JError (a different one every time I try!)

Thanks for your help Tim!

tpoterba · May 17, 2018, 4:13pm

oh, oops – what version are you using? The key_by interface was changed about a week ago.

What py4j error? that sounds bad.

Ruth_Mitchell · May 17, 2018, 4:23pm

Ah ok, I’m using a version from the 8th May - I can’t find the actually number. I’ll get repulled.

I’ve had several…
Py4JError: An error occurred while calling o95.selectCols
Py4JError: An error occurred while calling o39.annotateColsTable
Py4JError: An error occurred while calling o112.selectCols

They all seem to be from here:
./Hail/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py

I was getting similar ones this morning as I was trying to run through the tutorials (I decided to leave them and see if I could do what I actually wanted to do in Hail).

tpoterba · May 17, 2018, 4:34pm

can you make a new discuss post with the issues you’re seeing? Possibly it’s due to a mismatch between the Python and the Jar (if it’s something about “no method with signature blah blah blah”)

Topic		Replies	Views
Filter variants by sample id in gVCF Help [0.1]	20	1545	February 27, 2019
Filter samples from MatrixTable Hail Query & hailctl	8	676	October 22, 2021
Filter all variants which belong to a sample ID Hail Query & hailctl	6	616	December 13, 2019
Filter variants based on other files Hail Query & hailctl	3	437	February 9, 2022
Translation from hail 0.1 to 0.2 about keep only on-target Hail Query & hailctl	1	414	November 1, 2018

[Breaking Change] filter_samples_list now takes a list

Related topics