As title. See the new docs: https://hail.is/hail/hail.VariantDataset.html#hail.VariantDataset.filter_samples_list
Would be nice to add in the help an example regarding how to read a file, pass it to keytable, transform it in list and finally filter using filter_samples_list
For example:
sdrop=hc.import_keytable(stroot + 'samples_to_drop.txt',
config=hail.TextTableConfig(noheader=True)).key_by('_0')
sdropL = [item._0 for item in sdrop.collect()]
vds.filter_samples_list(sdropL, keep=False)
Why go through keytable?
samples = []
with open(filename, 'r') as f:
for line in f:
samples.append(line.strip())
vds = vds.filter_samples_list(samples)
Has this been put into version 0.2? Or is this just a case of filtering columns as a bgen file now imports as a MatrixTable? In which case I canāt find how to filter columns based on a text file/list.
Also, the same question holds for filtering variants. filter_samples_list() and
filter_variants_table() were the two methods I was really hoping to use in Hail!
The easiest way to do both of these is going to be by importing to a table.
Suppose I have a file that looks like this:
NA12878
NA12891
NA12892
...
And a file that looks like this:
Variant
1:1:A:T
1:5:C:CC
...
Then the easiest way to keep these samples and these sites is going to be:
sample_table = hl.import_table(sample_file, no_header=True, key='f0')
variant_table = hl.import_table(variant_file)
# parse the chr:pos:ref:alt to locus / alleles fields
variant_table = variant_table.key_by(**hl.parse_variant(variant_table['Variant']))
# filter to samples in the table
mt = mt.filter_cols(hl.is_defined(sample_table[mt.col_key]))
# filter to variants in the table
mt = mt.filter_rows(hl.is_defined(variant_table[mt.row_key]))
ok, yep, that makes sense.
My variants are a list of rsids and Iām running into the error:
TypeError: key_by() got an unexpected keyword argument ālocusā
Iāve tried changing mt key to ārsidā so that they match but this shuts down the SparkContext! (Error summary: SparkException: Job 3 cancelled because SparkContext was shut down)
With the samples, when I tried to filter them, I get a Py4JError (a different one every time I try!)
Thanks for your help Tim!
oh, oops ā what version are you using? The key_by interface was changed about a week ago.
What py4j error? that sounds bad.
Ah ok, Iām using a version from the 8th May - I canāt find the actually number. Iāll get repulled.
Iāve had severalā¦
Py4JError: An error occurred while calling o95.selectCols
Py4JError: An error occurred while calling o39.annotateColsTable
Py4JError: An error occurred while calling o112.selectCols
They all seem to be from here:
./Hail/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py
I was getting similar ones this morning as I was trying to run through the tutorials (I decided to leave them and see if I could do what I actually wanted to do in Hail).
can you make a new discuss post with the issues youāre seeing? Possibly itās due to a mismatch between the Python and the Jar (if itās something about āno method with signature blah blah blahā)