Filter samples from MatrixTable

nonchev · September 10, 2020, 8:53pm

Hello,

i want to filter from MatrixTable only individuals(samples) of interest
I found in previous post:

mt.filter_rows(mt.s == ‘NA00001’)

which works for single sample, but how can I filter it if I have a list of samples such as:

samples = [‘NA00001’, ‘NA00002’]

johnc1231 · September 10, 2020, 8:59pm

mt.filter_cols(hl.array(samples).contains(mt.s))

Shiri.Margalit · April 8, 2021, 12:33pm

Hi,
I’m doing exactly this solution and I’m getting this error:
TypeError: array: parameter ‘collection’: expected expression of type set or array or dict<(‘any’, ‘any’)>, found

This is my code:
samples = table[‘target panel prefix’]
mt = mt.filter_cols(hl.array(samples).contains(mt.s))

What can be the reason?
Thanks!
Shiri

tpoterba · April 8, 2021, 1:14pm

samples here isn’t a python list, it’s a table field. If you do:

samples = table['target panel prefix'].collect()

This should work, I think

Shiri.Margalit · April 8, 2021, 1:19pm

Thank you so much, it worked!

shuang · October 21, 2021, 1:59pm

Hi,
If I have a txt file which listed all samples ID without header. Looks like:

Now I want to filter out these samples from my MT file. What should I do?
with Hail v0.1 I do:

to_remove = hc.import_table('outliers.txt', no_header=True).key_by('f0')
vds_filtered = vds.filter_samples_table(to_remove, keep=False)

I am new to v0.2. According the above answer, I think it might should be:

to_remove = hl.import_table('outliers.txt', no_header=True).key_by('f0')
samples_to_remove = to_remove.collect()
filtered_mt = mt.filter_cols(hl.array(samples_to_remove).contains(mt.s), keep=False)

But I am not sure about it and I do not understand what ‘.contains(mt.s)’ do, any help?
Thanks a lot!

shuang · October 22, 2021, 3:19pm

I make it work by:

sample_table = hl.import_table('outlier.txt', no_header=True).key_by('f0')
filt_mt = mt.filter_cols(hl.is_defined(sample_table[mt.col_key]), keep=False)

tpoterba · October 22, 2021, 3:21pm

that solution is exactly what we would have suggested!

There’s also a set of anti_join methods which translates to “remove keys appearing in this other table” that can make this a little more terse:

sample_table = hl.import_table('outlier.txt', no_header=True).key_by('f0')
filt_mt = mt.anti_join_cols(sample_table)

shuang · October 22, 2021, 3:31pm

Thanks Tim and good to know this “.anti_join_cols” method!

Topic		Replies	Views
Select certain samples from MatrixTable Hail Query & hailctl	9	827	October 6, 2022
Selecting samples from a list of IDs from a matrixTable Hail Query & hailctl	3	462	August 15, 2023
Filter cols from python list Hail Query & hailctl	1	470	February 23, 2022
[Breaking Change] filter_samples_list now takes a list Updates	8	1371	May 17, 2018
Filtering MatrixTable for genotype in specific sample Hail Query & hailctl	7	1694	January 8, 2019

Filter samples from MatrixTable

Related topics