Filter sample id not in trio matrix

Hi,

I have a large matrix table that has samples with both parents and samples that either have one parent or have no parent information. I would like to separate the matrix table into a trio matrix and the rest sample that are not in the trio matrix into a case-control matrix.

The trio matrix can be generated using the trio_matrix() function.
I know one way to select the case-control matrix is using sample id by excluding s from the proband.s, father.s and mother.s, I just want to know if there is a more efficient way to do so, like non-trio_matrix?

Thanks,

Hi @qing,

Could you give a bit more detail on what you’re trying to do? What is the schema of your matrix table (the output of mt.describe()), or at least the relevant part of it? Do you have code for the “non efficient” way to make the case-control matrix, or if not could you explain a bit more what should be in the case-control matrix?

Thanks for the reply.

Here’s the mt.describe() of my original matrix table:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'pheno': struct {
        fam_id: str, 
        pat_id: str, 
        mat_id: str, 
        is_female: bool, 
        is_case: bool
    }
    'sample_id': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AF: array<float64>, 
        AQ: array<int32>, 
        AC: array<int32>, 
        AN: int32, 
        segdup_flag: str
    }
    'is_y': bool
----------------------------------------
Entry fields:
    'GT': call
    'RNC': array<str>
    'DP': int32
    'AD': array<int32>
    'SB': array<int32>
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['sample_id']
Row key: ['locus', 'alleles']
----------------------------------------

Here’s the output from trio_mt.describe():

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    'id': str
    'proband': struct {
        s: str, 
        pheno: struct {
            fam_id: str, 
            pat_id: str, 
            mat_id: str, 
            is_female: bool, 
            is_case: bool
        }, 
        sample_id: str, 
    }
    'father': struct {
        s: str, 
        pheno: struct {
            fam_id: str, 
            pat_id: str, 
            mat_id: str, 
            is_female: bool, 
            is_case: bool
        }, 
        sample_id: str, 
    }
    'mother': struct {
        s: str, 
        pheno: struct {
            fam_id: str, 
            pat_id: str, 
            mat_id: str, 
            is_female: bool, 
            is_case: bool
        }, 
        sample_id: str, 
    }
    'is_female': bool
    'fam_id': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AF: array<float64>, 
        AQ: array<int32>, 
        AC: array<int32>, 
        AN: int32, 
        segdup_flag: str
    }
    'is_y': bool
----------------------------------------
Entry fields:
    'proband_entry': struct {
        GT: call, 
        RNC: array<str>, 
        DP: int32, 
        AD: array<int32>, 
        SB: array<int32>, 
        GQ: int32, 
        PL: array<int32>
    }
    'father_entry': struct {
        GT: call, 
        RNC: array<str>, 
        DP: int32, 
        AD: array<int32>, 
        SB: array<int32>, 
        GQ: int32, 
        PL: array<int32>
    }
    'mother_entry': struct {
        GT: call, 
        RNC: array<str>, 
        DP: int32, 
        AD: array<int32>, 
        SB: array<int32>, 
        GQ: int32, 
        PL: array<int32>
    }
----------------------------------------
Column key: ['id']
Row key: ['locus', 'alleles']
----------------------------------------

When I rethink this question, I found both matrix tables used the sample id as a column key, labeled as “sample_id” in mt, and “id” in trio_mt. The “id” in trio_mt should be the union id from proband, father, and mother. So, to exclude the samples from the trio matrix, I could filter the id in trio_mt from sample_id in mt. Is that correct?

Then, my question would be how to annotate columns in mt using the id from trio_mt, here’s my code:

trio_mt = hl.trio_matrix(mt, pedigree, complete_trios=True)
mt = mt.annotate_cols(id_in_trio = trio_mt[mt.sample_id].id)
cc_mt = mt.filter_cols(mt.sample_id == mt.id_in_trio, keep=False)

But I got this error message:

TypeError                                 Traceback (most recent call last)
<ipython-input-24-c24456f7622f> in <module>
      1 trio_mt = hl.trio_matrix(mt, pedigree, complete_trios=True)
----> 2 mt = mt.annotate_cols(id_in_trio = trio_mt[mt.sample_id].id)
      3 cc_mt = mt.filter_cols(mt.sample_id == mt.id_in_trio, keep=False)

/gpfs/home/qwu24/ngs/lib/python3.7/site-packages/hail/matrixtable.py in __getitem__(self, item)
    627             except TypeError as e:
    628                 raise invalid_usage from e
--> 629         raise invalid_usage
    630 
    631     @property

TypeError: MatrixTable.__getitem__: invalid index argument(s)
  Usage 1: field selection: mt['field']
  Usage 2: Entry joining: mt[mt2.row_key, mt2.col_key]

  To join row or column fields, use one of the following:
    rows:
       mt.index_rows(mt2.row_key)
       mt.rows().index(mt2.row_key)
       mt.rows()[mt2.row_key]
    cols:
       mt.index_cols(mt2.col_key)
       mt.cols().index(mt2.col_key)
       mt.cols()[mt2.col_key]

Any help would be appreciated.

The error is saying that on a matrix table, the index notation mt[...] can only be used to index an entry with a (row key, column key) pair.

I can’t say if that is what you want. But to achieve that, you can use anti_join_cols:

cc_mt = mt.anti_join_cols(trio_mt.cols)