How to filter phenotype file based on sample ID?

I’m trying to filter a phenotype file pheno.tsv.bgz to selected 50K individuals. This is what I did
pheno = hl.import_table('pheno.tsv.bgz')
samples = hl.import_table("50K.sample",delimiter='\s+').key_by('ID_1')
pheno_filt = pheno.filter(pheno.is_defined(samples[pheno.s]))
it didn’t work because Table instance has no field, method, or property 'is_defined'

What’s the best way to do this?

I believe the issue is you should be calling hl.is_defined, not pheno.is_defined

That’s it. Thanks!

also note this is a semi_join, which might make the code a bit more readable.

Are the output from either semi_join or is_defined preserve the order in my samples file?

the important thing is that filter does not change order. semi_join is implemented as exactly what you’re doing here! :slight_smile:

def semi_join(self, other: 'Table') -> 'Table':
    return self.filter(hl.is_defined(other.index(self.key)))

I see. I should really be using semi_join
So if I semi_join sample IDs “1,2,3,4” with “4,3” will the outcome be ordered as “3,4” or “4,3”?
The context is that I’m subsampling both genotypes and phenotypes and I’d like the outcomes in both to be in exactly the same order. If both outcomes follows the order of the subset of samples ID’s I have, then I’m good.

I believe that currently neither semi_join nor your is_defined join (which are doing the same thing) preserve the order of the input table. I think we can consider this a bug. There are two workarounds right now:

  1. Add a field that refers to the original order after importing the phenos table:
pheno = hl.import_table('pheno.tsv.bgz').add_index()

... key by and filter ...
pheno = pheno.order_by(pheno.idx)
  1. Don’t do a join, instead do a local set lookup. This won’t scale as well as the above.
keep_samples_set = hl.literal(hl.set(samples.s.collect()))
pheno_filt = pheno.filter(keep_samples_set.contains(pheno.s))

the basic understanding here is that in Hail, keyed tables should be thought of as sorted by the key (here, lexicographically by sample ID). However, key_by does not guarantee that the ordering of the returned table is stable with respect to its parent. A foreign key join like you’ve written down (pheno_filt = pheno.filter(hl.is_defined(samples[pheno.s]))) does internal key_by operations, and therefore does not preserve the original order.

We can add a guarantee that the original order is preserved, though this will come at a performance cost.

Thanks for the quick responses! I have two more follow-up questions:

  1. How do people usually subsample genotypes and phenotypes in hail? Is there a more idiomatic way to tell hail I want a matching subset of genotypes and phenotypes?

  2. Can I easily check whether or not two keyed tables have the same order of keys?

  1. Can you describe what you want going in and going out? Running filter_cols will probably do what you want.
  2. I think you shouldn’t really be thinking much about key order – you should be thinking about key identity. Where is order coming into your workflow?

OK you are right. I actually don’t need the key orders to be exactly the same. I’m overthinking this.
Here’s my code for subsampling genotype and phenotype. I did use filter_cols for the genotype filtering.

ds = hl.import_bgen(bgen_fname,
                    variants = variants,
samples = hl.import_table("autosomes.50K.sample",delimiter='\s+').key_by('ID_1')
# down sample to 50_000 samples
bgen = ds.filter_cols(hl.is_defined(samples[ds.s]))      
pheno = hl.import_table('pheno.tsv.bgz').key_by('s')
pheno_filt = pheno.semi_join(samples)

I think this would work for my purpose, which is to do a GWAS for the down-sampled dataset.

Yep, this looks very straightforward!

Also, there’s a semi_join_cols for MTs:

bgen = ds.filter_cols(hl.is_defined(samples[ds.s]))      
# can be
bgen = ds.semi_join_cols(samples)

I think it also might be more straightforward to annotate the phenotypes onto the downsampled MT instead of filtering and writing a separate file. That seems like the most simple design here – and if you just need to use the phenotypes, then read_matrix_table(...).cols() will only load the column information, not the genotype data (should be as efficient as reading a table).

Annotating sounds like a great way to do this. Would the fact that there are hundreds (if not thousands) of phenotype information for each individual be a problem?
What’s the idea behind annotating and how to do that with the data I have? Would this work?

pheno = hl.import_table('pheno.tsv.bgz').key_by('s')
ds = hl.import_bgen(bgen_fname,
                        variants = variants,
samples = hl.import_table("autosomes.50K.sample",

bgen = ds.semi_join_cols(samples)
bgen = bgen.annotate_cols(pheno = pheno)

Hmm, yes, possibly. But for 50k it should be fine.

This line:

bgen = bgen.annotate_cols(pheno = pheno)

Needs to be

bgen = bgen.annotate_cols(pheno = pheno[bgen.s])

The annotating went well. But when I’m trying to write the data into a matrix table I got this:

Exception while sending command.
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

I’m running version 0.2.21-f16fd64e0d77.
I did a little search on the forum and looks like maybe I need to update something? If that’s the case, can you help me by writing down exactly what I need to do please? If that’s not too much trouble.