[Breaking Change] filter_samples_list now takes a list

As title. See the new docs: https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.filter_samples_list

Would be nice to add in the help an example regarding how to read a file, pass it to keytable, transform it in list and finally filter using filter_samples_list
For example:

sdrop=hc.import_keytable(stroot + 'samples_to_drop.txt',
                           config=hail.TextTableConfig(noheader=True)).key_by('_0')

sdropL = [item._0 for item in sdrop.collect()]

vds.filter_samples_list(sdropL, keep=False)

Why go through keytable?

samples = []
with open(filename, 'r') as f:
    for line in f:
        samples.append(line.strip())

vds = vds.filter_samples_list(samples)

Has this been put into version 0.2? Or is this just a case of filtering columns as a bgen file now imports as a MatrixTable? In which case I canā€™t find how to filter columns based on a text file/list.
Also, the same question holds for filtering variants. filter_samples_list() and
filter_variants_table() were the two methods I was really hoping to use in Hail!

The easiest way to do both of these is going to be by importing to a table.

Suppose I have a file that looks like this:

NA12878
NA12891
NA12892
...

And a file that looks like this:

Variant
1:1:A:T
1:5:C:CC
...

Then the easiest way to keep these samples and these sites is going to be:

sample_table = hl.import_table(sample_file, no_header=True, key='f0')
variant_table = hl.import_table(variant_file)

# parse the chr:pos:ref:alt to locus / alleles fields
variant_table = variant_table.key_by(**hl.parse_variant(variant_table['Variant']))

# filter to samples in the table
mt = mt.filter_cols(hl.is_defined(sample_table[mt.col_key]))

# filter to variants in the table
mt = mt.filter_rows(hl.is_defined(variant_table[mt.row_key]))
1 Like

ok, yep, that makes sense.

My variants are a list of rsids and Iā€™m running into the error:
TypeError: key_by() got an unexpected keyword argument ā€˜locusā€™

Iā€™ve tried changing mt key to ā€˜rsidā€™ so that they match but this shuts down the SparkContext! (Error summary: SparkException: Job 3 cancelled because SparkContext was shut down)

With the samples, when I tried to filter them, I get a Py4JError (a different one every time I try!)

Thanks for your help Tim!

oh, oops ā€“ what version are you using? The key_by interface was changed about a week ago.

What py4j error? that sounds bad.

Ah ok, Iā€™m using a version from the 8th May - I canā€™t find the actually number. Iā€™ll get repulled.

Iā€™ve had severalā€¦
Py4JError: An error occurred while calling o95.selectCols
Py4JError: An error occurred while calling o39.annotateColsTable
Py4JError: An error occurred while calling o112.selectCols

They all seem to be from here:
./Hail/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py

I was getting similar ones this morning as I was trying to run through the tutorials (I decided to leave them and see if I could do what I actually wanted to do in Hail).

can you make a new discuss post with the issues youā€™re seeing? Possibly itā€™s due to a mismatch between the Python and the Jar (if itā€™s something about ā€œno method with signature blah blah blahā€)