Exporting data from MatrixTable into TSV

Hi all! I know that I am able to export a Table into a TSV file, but can I choose a few entries/rows/cols from a MatrixTable and export it into a TSV? Thank you so much!

Exporting columns to a TSV is not possible because TSVs have no obvious way to represent column metadata. You could export a single entry field to a TSV:

small_mt.GT.export('output/gt.tsv')

If you want to include more row fields you can add them as key fields:

small_mt = small_mt.key_rows_by(*mt.row_key, 'rowfield1', 'rowfield2', ...)
small_mt.GT.export('output/gt-with-some-row-fields.tsv')

You can export all of the entry fields, but they’ll be exported as a JSON dictionary:

small_mt.entry.export('output/all-entry-fields-as-json-dict.tsv')

More examples here.


Aside: If you’re just trying to get the data out of Hail and into Python, you might try localize_entries followed by collect:

In [9]: mt = hl.utils.range_matrix_table(2, 2) 
   ...: mt = mt.annotate_entries(x = (mt.row_idx + 1) * (mt.col_idx + 1)) 
   ...: t = mt.localize_entries('entries') 
   ...: t.collect()                                                                                                                                                                                                 
Out[9]: 
[Struct(row_idx=0, entries=[Struct(x=1), Struct(x=2)]),
 Struct(row_idx=1, entries=[Struct(x=2), Struct(x=4)])]

If you’re trying to convert Hail MTs to pandas, you probably want something like this:

In [28]: import hail as hl
    ...: import pandas as pd
    ...: 
    ...: mt = hl.utils.range_matrix_table(2, 2) 
    ...: mt = mt.annotate_entries(x = (mt.row_idx + 1) * (mt.col_idx + 1)) 
    ...: mt = mt.annotate_rows(row_field = hl.str('hello! ') + hl.str(mt.row_idx)) 
    ...:  
    ...:  
    ...: sample_names = hl.delimit(hl.array([hl.str(mt[key_field]) for key_field in mt.col_key]), '_').collect() 
    ...: t = mt.localize_entries('entries') 
    ...: t = t.annotate(**{ 
    ...:     s + '_' + field: t.entries[i][field] 
    ...:     for i, s in enumerate(sample_names) 
    ...:     for field in mt.entry 
    ...: }) 
    ...: t = t.drop('entries') 
    ...: df = pd.DataFrame(t.collect()) 
    ...: df.set_index(list(mt.row_key), inplace=True) 
    ...: df                                                                                                                                                                                                         
Out[28]: 
         0_x  1_x row_field
row_idx                    
0          1    2  hello! 0
1          2    4  hello! 1

Thank you so much for your response @danking! I am trying to export, but it seems to be taking a while. Is this an expensive operation?

How many rows and how many columns are in your matrix table?

I believe there are at least a few hundred thousand rows…

If you have even a modest 1000 samples (columns), the absolute best case scenario is that your pandas data frame is about this big:

In [5]: 1000 * 500_000 * 8 / 1024 / 1024 / 1024                                                                                                                                                                     
Out[5]: 3.725290298461914

Nearly 3.7 GB. In reality it’s almost certainly bigger.

How many columns do you have? And do you have any code in between:

mt = hl.read_matrix_table(...)

and

t = mt.localize_entries('entries')

?

If you share your whole script I can help you find slow parts, but, in general, you should avoid loading an entire genetics dataset into memory (e.g. into pandas). Most genetics datasets are require the use of streaming operations which Hail automatically uses.

I just tried this:
new_mt.entries().PBT_GT.export(“gt.tsv”)
and it somehow worked. The file was around 100+ MB although they are a few million lines. Does this file size make sense, or was it truncated?

I can’t say if that file size makes sense because I do not know how many columns/samples you have. In general, I have no reason to believe that command will truncate the output.

If you’re exporting to text, I recommend using gt.tsv.bgz which will automatically compress the file and should be faster.

new_mt.entries().PBT_GT.export("gt.tsv") will create one row per variant, per sample. So you’ll have n_variants * n_samples rows. This is different from the VCF representation which has one row per variant.