As title. See the new docs: https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.filter_samples_list
Would be nice to add in the help an example regarding how to read a file, pass it to keytable, transform it in list and finally filter using
sdrop=hc.import_keytable(stroot + 'samples_to_drop.txt', config=hail.TextTableConfig(noheader=True)).key_by('_0') sdropL = [item._0 for item in sdrop.collect()] vds.filter_samples_list(sdropL, keep=False)
Why go through keytable?
samples =  with open(filename, 'r') as f: for line in f: samples.append(line.strip()) vds = vds.filter_samples_list(samples)
Has this been put into version 0.2? Or is this just a case of filtering columns as a bgen file now imports as a MatrixTable? In which case I can’t find how to filter columns based on a text file/list.
Also, the same question holds for filtering variants. filter_samples_list() and
filter_variants_table() were the two methods I was really hoping to use in Hail!
The easiest way to do both of these is going to be by importing to a table.
Suppose I have a file that looks like this:
NA12878 NA12891 NA12892 ...
And a file that looks like this:
Variant 1:1:A:T 1:5:C:CC ...
Then the easiest way to keep these samples and these sites is going to be:
sample_table = hl.import_table(sample_file, no_header=True, key='f0') variant_table = hl.import_table(variant_file) # parse the chr:pos:ref:alt to locus / alleles fields variant_table = variant_table.key_by(**hl.parse_variant(variant_table['Variant'])) # filter to samples in the table mt = mt.filter_cols(hl.is_defined(sample_table[mt.col_key])) # filter to variants in the table mt = mt.filter_rows(hl.is_defined(variant_table[mt.row_key]))
ok, yep, that makes sense.
My variants are a list of rsids and I’m running into the error:
TypeError: key_by() got an unexpected keyword argument ‘locus’
I’ve tried changing mt key to ‘rsid’ so that they match but this shuts down the SparkContext! (Error summary: SparkException: Job 3 cancelled because SparkContext was shut down)
With the samples, when I tried to filter them, I get a Py4JError (a different one every time I try!)
Thanks for your help Tim!
oh, oops – what version are you using? The key_by interface was changed about a week ago.
What py4j error? that sounds bad.
Ah ok, I’m using a version from the 8th May - I can’t find the actually number. I’ll get repulled.
I’ve had several…
Py4JError: An error occurred while calling o95.selectCols
Py4JError: An error occurred while calling o39.annotateColsTable
Py4JError: An error occurred while calling o112.selectCols
They all seem to be from here:
I was getting similar ones this morning as I was trying to run through the tutorials (I decided to leave them and see if I could do what I actually wanted to do in Hail).
can you make a new discuss post with the issues you’re seeing? Possibly it’s due to a mismatch between the Python and the Jar (if it’s something about “no method with signature blah blah blah”)