Filter sample id not in trio matrix

qing · April 12, 2022, 8:41pm

Hi,

I have a large matrix table that has samples with both parents and samples that either have one parent or have no parent information. I would like to separate the matrix table into a trio matrix and the rest sample that are not in the trio matrix into a case-control matrix.

The trio matrix can be generated using the trio_matrix() function.
I know one way to select the case-control matrix is using sample id by excluding s from the proband.s, father.s and mother.s, I just want to know if there is a more efficient way to do so, like non-trio_matrix?

Thanks,

patrick-schultz · April 12, 2022, 8:52pm

Hi @qing,

Could you give a bit more detail on what you’re trying to do? What is the schema of your matrix table (the output of mt.describe()), or at least the relevant part of it? Do you have code for the “non efficient” way to make the case-control matrix, or if not could you explain a bit more what should be in the case-control matrix?

qing · April 12, 2022, 9:32pm

Thanks for the reply.

Here’s the mt.describe() of my original matrix table:

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
    'pheno': struct {
        fam_id: str, 
        pat_id: str, 
        mat_id: str, 
        is_female: bool, 
        is_case: bool
    }
    'sample_id': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AF: array<float64>, 
        AQ: array<int32>, 
        AC: array<int32>, 
        AN: int32, 
        segdup_flag: str
    }
    'is_y': bool
----------------------------------------
Entry fields:
    'GT': call
    'RNC': array<str>
    'DP': int32
    'AD': array<int32>
    'SB': array<int32>
    'GQ': int32
    'PL': array<int32>
----------------------------------------
Column key: ['sample_id']
Row key: ['locus', 'alleles']
----------------------------------------

Here’s the output from trio_mt.describe():

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    'id': str
    'proband': struct {
        s: str, 
        pheno: struct {
            fam_id: str, 
            pat_id: str, 
            mat_id: str, 
            is_female: bool, 
            is_case: bool
        }, 
        sample_id: str, 
    }
    'father': struct {
        s: str, 
        pheno: struct {
            fam_id: str, 
            pat_id: str, 
            mat_id: str, 
            is_female: bool, 
            is_case: bool
        }, 
        sample_id: str, 
    }
    'mother': struct {
        s: str, 
        pheno: struct {
            fam_id: str, 
            pat_id: str, 
            mat_id: str, 
            is_female: bool, 
            is_case: bool
        }, 
        sample_id: str, 
    }
    'is_female': bool
    'fam_id': str
----------------------------------------
Row fields:
    'locus': locus<GRCh38>
    'alleles': array<str>
    'rsid': str
    'qual': float64
    'filters': set<str>
    'info': struct {
        AF: array<float64>, 
        AQ: array<int32>, 
        AC: array<int32>, 
        AN: int32, 
        segdup_flag: str
    }
    'is_y': bool
----------------------------------------
Entry fields:
    'proband_entry': struct {
        GT: call, 
        RNC: array<str>, 
        DP: int32, 
        AD: array<int32>, 
        SB: array<int32>, 
        GQ: int32, 
        PL: array<int32>
    }
    'father_entry': struct {
        GT: call, 
        RNC: array<str>, 
        DP: int32, 
        AD: array<int32>, 
        SB: array<int32>, 
        GQ: int32, 
        PL: array<int32>
    }
    'mother_entry': struct {
        GT: call, 
        RNC: array<str>, 
        DP: int32, 
        AD: array<int32>, 
        SB: array<int32>, 
        GQ: int32, 
        PL: array<int32>
    }
----------------------------------------
Column key: ['id']
Row key: ['locus', 'alleles']
----------------------------------------

When I rethink this question, I found both matrix tables used the sample id as a column key, labeled as “sample_id” in mt, and “id” in trio_mt. The “id” in trio_mt should be the union id from proband, father, and mother. So, to exclude the samples from the trio matrix, I could filter the id in trio_mt from sample_id in mt. Is that correct?

Then, my question would be how to annotate columns in mt using the id from trio_mt, here’s my code:

trio_mt = hl.trio_matrix(mt, pedigree, complete_trios=True)
mt = mt.annotate_cols(id_in_trio = trio_mt[mt.sample_id].id)
cc_mt = mt.filter_cols(mt.sample_id == mt.id_in_trio, keep=False)

But I got this error message:

TypeError                                 Traceback (most recent call last)
<ipython-input-24-c24456f7622f> in <module>
      1 trio_mt = hl.trio_matrix(mt, pedigree, complete_trios=True)
----> 2 mt = mt.annotate_cols(id_in_trio = trio_mt[mt.sample_id].id)
      3 cc_mt = mt.filter_cols(mt.sample_id == mt.id_in_trio, keep=False)

/gpfs/home/qwu24/ngs/lib/python3.7/site-packages/hail/matrixtable.py in __getitem__(self, item)
    627             except TypeError as e:
    628                 raise invalid_usage from e
--> 629         raise invalid_usage
    630 
    631     @property

TypeError: MatrixTable.__getitem__: invalid index argument(s)
  Usage 1: field selection: mt['field']
  Usage 2: Entry joining: mt[mt2.row_key, mt2.col_key]

  To join row or column fields, use one of the following:
    rows:
       mt.index_rows(mt2.row_key)
       mt.rows().index(mt2.row_key)
       mt.rows()[mt2.row_key]
    cols:
       mt.index_cols(mt2.col_key)
       mt.cols().index(mt2.col_key)
       mt.cols()[mt2.col_key]

Any help would be appreciated.

patrick-schultz · April 13, 2022, 12:17pm

The error is saying that on a matrix table, the index notation mt[...] can only be used to index an entry with a (row key, column key) pair.

I can’t say if that is what you want. But to achieve that, you can use anti_join_cols:

cc_mt = mt.anti_join_cols(trio_mt.cols)

Topic		Replies	Views
Stuck at writing Hail Table Hail Query & hailctl	11	501	February 6, 2023
Extract transmitted/non-transmitted variants Hail Query & hailctl	9	346	December 16, 2022
Filter samples from MatrixTable Hail Query & hailctl	8	668	October 22, 2021
Quad data with trio_matrix Hail Query & hailctl	1	292	September 14, 2022
Filtering variants from a hail matrix present in another hail matrix Hail Query & hailctl	8	529	December 17, 2020

Filter sample id not in trio matrix

Related topics