Annotate column with a list, then merge two matrix tables

Hello - I have two matrix tables, geno and pheno. geno is read in from bgen files, so the sample is already there. pheno is organized such that each row corresponds to a sample. I have a list of sample IDs that I would like to annotate onto the rows of geno

samples = [ ] # list of sample IDs as strings (eg ['001', '002', '003'])
pheno_mt.annotate_rows(s = samples) # this does not work

But when I pheno.describe() I cannot saee the s field. How can I annotate so that samples[i] corresponds to pheno_mt row key i?

Once this is done, I need to subset pheno to reflect only the samples in geno, since we do not have geno data on some samples we have pheno data for.

Something like:

pheno_subset_mt = pheno.filter_cols(pheno.s.contains(geno.s))

but I’m sure my syntax is wrong somewhere.

Thanks in advance!

Update:

samples = [ ] # list of sample IDs as strings (eg ['001', '002', '003'])
pheno_mt = pheno_mt.annotate_rows(s = samples)

The above succeeds in the annotating the rows, but it adds the entire list to each row.

Hi @dmoracze ,

I think you’ve said you have two matrix tables:

  • geno, rows are variants, columns are samples, entries are genotypes
  • phenos, rows are samples, columns are phenotypes, entries are measurements of said phenotype

I think you want one matrix table:

  • geno, rows are variants, columns are samples, entries are genotypes, and one sample annotation for every phenotype

I created sample data to match your data. My phenos file looks like this:

sample_id,weight,height
abc123,150,72
def456,135,60
ghi789,170,66

I think this does what you want, but it’s a bit complicated:

In [1]: import hail as hl 
   ...:  
   ...: mt = hl.import_bgen('/tmp/geno.bgen', entry_fields=['GP']) 
   ...: mt.describe() 
   ...:  
   ...: ph = hl.import_matrix_table('/tmp/phenos', delimiter=',', row_fields={'sample_id': hl.tstr}, row_key=['sample_id']) 
   ...: ph.describe() 
   ...:  
   ...: # collect all phenotypes as a struct 
   ...: all_phenotypes = ph.aggregate_cols(hl.agg.collect(ph.col_id)) 
   ...: ph = ph.annotate_rows(phenos_dict = hl.dict(hl.agg.collect((ph.col_id, ph.x)))) 
   ...: ph = ph.annotate_rows(**{ 
   ...:     phenotype: ph.phenos_dict[phenotype] 
   ...:     for phenotype in all_phenotypes 
   ...: }) 
   ...:  
   ...: # annotate phenotypes into the column fields of genos 
   ...: mt = mt.annotate_cols(**ph.rows()[mt.s]) 
   ...: mt.cols().show() 
   ...: mt.show()                                                                                                                                                                             
2021-05-24 16:30:11 Hail: INFO: Number of BGEN files parsed: 1
2021-05-24 16:30:11 Hail: INFO: Number of samples in BGEN files: 3
2021-05-24 16:30:11 Hail: INFO: Number of variants across all BGEN files: 10
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'varid': str
----------------------------------------
Entry fields:
    'GP': array<float64>
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------
----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    'col_id': str
----------------------------------------
Row fields:
    'sample_id': str
----------------------------------------
Entry fields:
    'x': int32
----------------------------------------
Column key: ['col_id']
Row key: ['sample_id']
----------------------------------------
2021-05-24 16:30:11 Hail: INFO: Coerced sorted dataset
2021-05-24 16:30:11 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
2021-05-24 16:30:11 Hail: INFO: Coerced sorted dataset
+----------+----------------------------+--------+--------+
| s        | phenos_dict                | height | weight |
+----------+----------------------------+--------+--------+
| str      | dict<str, int32>           |  int32 |  int32 |
+----------+----------------------------+--------+--------+
| "abc123" | {"height":72,"weight":150} |     72 |    150 |
| "def456" | {"height":60,"weight":135} |     60 |    135 |
| "ghi789" | {"height":66,"weight":170} |     66 |    170 |
+----------+----------------------------+--------+--------+
+---------------+--------------------------------+------------------------------+------------------------------+------------------------------+
| locus         | alleles                        | 'abc123'.GP                  | 'def456'.GP                  | 'ghi789'.GP                  |
+---------------+--------------------------------+------------------------------+------------------------------+------------------------------+
| locus<GRCh37> | array<str>                     | array<float64>               | array<float64>               | array<float64>               |
+---------------+--------------------------------+------------------------------+------------------------------+------------------------------+
| 1:10177       | ["A","AC"]                     | [0.00e+00,1.00e+00,0.00e+00] | [0.00e+00,1.00e+00,0.00e+00] | [0.00e+00,1.00e+00,0.00e+00] |
| 1:10235       | ["T","TA"]                     | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10352       | ["T","TA"]                     | [0.00e+00,1.00e+00,0.00e+00] | [0.00e+00,1.00e+00,0.00e+00] | [0.00e+00,1.00e+00,0.00e+00] |
| 1:10505       | ["A","T"]                      | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10506       | ["C","G"]                      | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10511       | ["G","A"]                      | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10539       | ["C","A"]                      | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10542       | ["C","T"]                      | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10579       | ["C","A"]                      | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] | [1.00e+00,0.00e+00,0.00e+00] |
| 1:10616       | ["CCGCCGTTGCAAAGGCGCGCCG","C"] | [0.00e+00,0.00e+00,1.00e+00] | [0.00e+00,0.00e+00,1.00e+00] | [0.00e+00,0.00e+00,1.00e+00] |
+---------------+--------------------------------+------------------------------+------------------------------+------------------------------+

I think you should import your phenos as a table, not a matrix table. The code for that is much simpler:

In [1]: import hail as hl 
   ...:  
   ...: mt = hl.import_bgen('/tmp/geno.bgen', entry_fields=['GP']) 
   ...: mt.describe() 
   ...:  
   ...:  
   ...: ph = hl.import_table('/tmp/phenos', delimiter=',', impute=True, key='sample_id') 
   ...: ph.describe() 
   ...:  
   ...: # annotate phenotypes into the column fields of genos 
   ...: mt = mt.annotate_cols(**ph[mt.s]) 
   ...: mt.cols().show() 
   ...: mt.show()                                                                                                                                                                              
1 Like

Yes, this is the case. I’m ultimately trying to join them to run a gwas, one per column in pheno.

Sorry I hit return too quickly! See my edited reply.

1 Like

Wonderful, thank you. I’ll try this tomorrow.

Hi @danking I formated my pheno file in a similar way as your toy example. Except my sample IDs are all numeric, though I want them to be read in as strings, with the rest of the table being float.

My full pheno file is 16g and I ran into a memory error using the command you gave:

ph = hl.import_table('/tmp/phenos', delimiter=',', impute=True, key='sample_id') 

Which is a bit confusing because I’m running this from a compute node on our HPC with an allotment of 64g of memory.

The types= kwarg to hl.import_table lets you specify types for each field. I think it cannot be used in combination with impute, which is unfortunate.

Regarding memory, have you set PYSPARK_SUBMIT_ARGS?

I have not. Currently I’m running this in a jupyter notebook while I get the pipeline sorted out, then I’ll batch gwas all the phenos using a python script.

I’ll try setting this env variable.

Hi there, I tried to increase the memory with:

hl.init(spark_conf={'spark.driver.memory': '64g'})
pheno = hl.import_table(idpsFile, delimiter=',', key='subj', types=keys)

where keys is a dict of keys. I’m not getting the memory error, but I’m receiving a new one:

FatalError: MethodTooLargeException: Method too large: __C426etypeEncode.__m427ENCODE_r_tuple_of_o_stringANDo_float64ANDo_float64ANDo_float64ANDo_float64ANDo_float64ANDo_float64ANDo_float64ANDo_float64.... 

The first column of this table is a string and the next 79,800 columns are floats. I’m not sure if I’m doing something wrong or if this is a bug in hail because tables might not normally be this big?

This is not a bug, but an expected limitation of our current compilation strategy on the JVM. The size of code generated scales with the number of table fields, and the JVM has limits on the size of generate methods and classes.

Hail (and nearly every query engine) will start to break down when the schema gets too large – above, to give a rough rule of thumb, 1000 fields or so.

The solution here is not to use import_table, but import_matrix_table. This will import your pheno file as a 2-dimensional matrix, instead of a table with 80k fields.

Once you have that matrix table, you can annotate groups of phenotypes onto another matrix table by using filter_cols first, then a method like localize_entries to convert to a table.

1 Like

@tpoterba, I’m trying to read in and save as a matrix table but I’m getting memory errors again.

# create a dict of keys
keys = dict()
keys['subj'] = hl.tstr
for i in range(1,79801):
    keys[f'fc{i}'] = hl.tfloat64

hl.init(spark_conf={'spark.driver.memory': '64g'})
idpsFile = 'ALL_IDPs_i_deconf_wSubj_wCols.csv'
pheno_mt = hl.import_matrix_table(idpsFile, delimiter=',', row_key='subj', row_fields=keys)
pheno_mt.write('idps_i_deconf.mt')
Hail version: 0.2.64-1ef70187dc78
Error summary: OutOfMemoryError: GC overhead limit exceeded

I’m running this on a node with 64g of memory. The csv table is 16g. Does Hail need significantly more memory to read in and save a matrix table than the size of the actual file?

Example of the table (11,533 x 79801)

subj,fc1,fc2,fc3,...
100,.0.55774264400953,0.371827323689553,0.679924402402685,...
101,-0.858469745958063,-0.224915104171631,0.888783404436071,...

The memory overflow here is due to the footprint of compiler intermediate data structures – Hail hasn’t actually started reading the data in the file beyond the header by the time it runs OOM.

The problem with your code above is that you read all fields as row fields, rather than entries of the matrix.You’ve still got 80k fields! What we want is 1-2 row fields for the key, and to have 80k columns of the matrix.

I think the following will do it:

pheno_mt = hl.import_matrix_table(idpsFile, delimiter=',', row_key='subj', row_fields={'subj': hl.str}, entry_type=hl.tfloat64)

Thank you, this worked!