How to filter phenotype file based on sample ID?

rancui · December 9, 2019, 7:32pm

I’m trying to filter a phenotype file pheno.tsv.bgz to selected 50K individuals. This is what I did
pheno = hl.import_table('pheno.tsv.bgz')
samples = hl.import_table("50K.sample",delimiter='\s+').key_by('ID_1')
pheno_filt = pheno.filter(pheno.is_defined(samples[pheno.s]))
it didn’t work because Table instance has no field, method, or property 'is_defined'

What’s the best way to do this?

johnc1231 · December 9, 2019, 7:38pm

I believe the issue is you should be calling hl.is_defined, not pheno.is_defined

rancui · December 9, 2019, 7:48pm

That’s it. Thanks!

tpoterba · December 9, 2019, 8:47pm

also note this is a semi_join, which might make the code a bit more readable.

rancui · December 10, 2019, 4:09pm

Are the output from either semi_join or is_defined preserve the order in my samples file?

tpoterba · December 10, 2019, 4:11pm

the important thing is that filter does not change order. semi_join is implemented as exactly what you’re doing here!

def semi_join(self, other: 'Table') -> 'Table':
    """docstring..."""
    return self.filter(hl.is_defined(other.index(self.key)))

rancui · December 10, 2019, 4:21pm

I see. I should really be using semi_join…
So if I semi_join sample IDs “1,2,3,4” with “4,3” will the outcome be ordered as “3,4” or “4,3”?
The context is that I’m subsampling both genotypes and phenotypes and I’d like the outcomes in both to be in exactly the same order. If both outcomes follows the order of the subset of samples ID’s I have, then I’m good.

tpoterba · December 10, 2019, 4:30pm

I believe that currently neither semi_join nor your is_defined join (which are doing the same thing) preserve the order of the input table. I think we can consider this a bug. There are two workarounds right now:

Add a field that refers to the original order after importing the phenos table:

pheno = hl.import_table('pheno.tsv.bgz').add_index()

... key by and filter ...
pheno = pheno.order_by(pheno.idx)
...

Don’t do a join, instead do a local set lookup. This won’t scale as well as the above.

keep_samples_set = hl.literal(hl.set(samples.s.collect()))
pheno_filt = pheno.filter(keep_samples_set.contains(pheno.s))

tpoterba · December 10, 2019, 4:34pm

the basic understanding here is that in Hail, keyed tables should be thought of as sorted by the key (here, lexicographically by sample ID). However, key_by does not guarantee that the ordering of the returned table is stable with respect to its parent. A foreign key join like you’ve written down (pheno_filt = pheno.filter(hl.is_defined(samples[pheno.s]))) does internal key_by operations, and therefore does not preserve the original order.

We can add a guarantee that the original order is preserved, though this will come at a performance cost.

rancui · December 10, 2019, 4:39pm

Thanks for the quick responses! I have two more follow-up questions:

How do people usually subsample genotypes and phenotypes in hail? Is there a more idiomatic way to tell hail I want a matching subset of genotypes and phenotypes?
Can I easily check whether or not two keyed tables have the same order of keys?

tpoterba · December 10, 2019, 4:41pm

Can you describe what you want going in and going out? Running filter_cols will probably do what you want.
I think you shouldn’t really be thinking much about key order – you should be thinking about key identity. Where is order coming into your workflow?

rancui · December 10, 2019, 4:59pm

OK you are right. I actually don’t need the key orders to be exactly the same. I’m overthinking this.
Here’s my code for subsampling genotype and phenotype. I did use filter_cols for the genotype filtering.

ds = hl.import_bgen(bgen_fname,
                    entry_fields=['dosage'],
                    sample_file="autosomes.sample",
                    variants = variants,
                    index_file_map=index_file_map)`
samples = hl.import_table("autosomes.50K.sample",delimiter='\s+').key_by('ID_1')
# down sample to 50_000 samples
bgen = ds.filter_cols(hl.is_defined(samples[ds.s]))      
bgen.write('ukb_imp_v3_50K.{}.mt'.format(chrom),overwrite=True)
pheno = hl.import_table('pheno.tsv.bgz').key_by('s')
pheno_filt = pheno.semi_join(samples)
pheno_filt.write('pheno.50K.mt',overwrite=True)

I think this would work for my purpose, which is to do a GWAS for the down-sampled dataset.

tpoterba · December 10, 2019, 5:03pm

Yep, this looks very straightforward!

Also, there’s a semi_join_cols for MTs:

bgen = ds.filter_cols(hl.is_defined(samples[ds.s]))      
# can be
bgen = ds.semi_join_cols(samples)

I think it also might be more straightforward to annotate the phenotypes onto the downsampled MT instead of filtering and writing a separate file. That seems like the most simple design here – and if you just need to use the phenotypes, then read_matrix_table(...).cols() will only load the column information, not the genotype data (should be as efficient as reading a table).

rancui · December 10, 2019, 7:08pm

Annotating sounds like a great way to do this. Would the fact that there are hundreds (if not thousands) of phenotype information for each individual be a problem?
What’s the idea behind annotating and how to do that with the data I have? Would this work?

pheno = hl.import_table('pheno.tsv.bgz').key_by('s')
ds = hl.import_bgen(bgen_fname,
                        entry_fields=['dosage'],
                        sample_file="autosomes.sample",
                        variants = variants,
                        index_file_map=index_file_map)
samples = hl.import_table("autosomes.50K.sample",
                        delimiter='\s+').key_by('ID_1')

bgen = ds.semi_join_cols(samples)
bgen = bgen.annotate_cols(pheno = pheno)
bgen.write('ukb_imp_v3_50K.{}.mt'.format(chrom),
                        overwrite=True)

tpoterba · December 11, 2019, 12:38pm

Hmm, yes, possibly. But for 50k it should be fine.

This line:

bgen = bgen.annotate_cols(pheno = pheno)

Needs to be

bgen = bgen.annotate_cols(pheno = pheno[bgen.s])

rancui · December 11, 2019, 2:58pm

The annotating went well. But when I’m trying to write the data into a matrix table I got this:

Exception while sending command.
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

I’m running version 0.2.21-f16fd64e0d77.
I did a little search on the forum and looks like maybe I need to update something? If that’s the case, can you help me by writing down exactly what I need to do please? If that’s not too much trouble.

Topic		Replies	Views
Filtering out samples using hl.is_nan Hail Query & hailctl	2	608	March 3, 2019
[Breaking Change] filter_samples_list now takes a list Updates	8	1362	May 17, 2018
Table filter expression Hail Query & hailctl	2	392	January 17, 2020
Filter variants based on other files Hail Query & hailctl	3	422	February 9, 2022
Unable to do sample/variant QC after combining MatrixTable Hail Query & hailctl	11	418	January 8, 2023

How to filter phenotype file based on sample ID?

Related topics