Exporting data from MatrixTable into TSV

CuriousGeneticist · December 13, 2021, 9:56pm

Hi all! I know that I am able to export a Table into a TSV file, but can I choose a few entries/rows/cols from a MatrixTable and export it into a TSV? Thank you so much!

danking · December 14, 2021, 3:35pm

Exporting columns to a TSV is not possible because TSVs have no obvious way to represent column metadata. You could export a single entry field to a TSV:

small_mt.GT.export('output/gt.tsv')

If you want to include more row fields you can add them as key fields:

small_mt = small_mt.key_rows_by(*mt.row_key, 'rowfield1', 'rowfield2', ...)
small_mt.GT.export('output/gt-with-some-row-fields.tsv')

You can export all of the entry fields, but they’ll be exported as a JSON dictionary:

small_mt.entry.export('output/all-entry-fields-as-json-dict.tsv')

More examples here.

Aside: If you’re just trying to get the data out of Hail and into Python, you might try localize_entries followed by collect:

In [9]: mt = hl.utils.range_matrix_table(2, 2) 
   ...: mt = mt.annotate_entries(x = (mt.row_idx + 1) * (mt.col_idx + 1)) 
   ...: t = mt.localize_entries('entries') 
   ...: t.collect()                                                                                                                                                                                                 
Out[9]: 
[Struct(row_idx=0, entries=[Struct(x=1), Struct(x=2)]),
 Struct(row_idx=1, entries=[Struct(x=2), Struct(x=4)])]

If you’re trying to convert Hail MTs to pandas, you probably want something like this:

In [28]: import hail as hl
    ...: import pandas as pd
    ...: 
    ...: mt = hl.utils.range_matrix_table(2, 2) 
    ...: mt = mt.annotate_entries(x = (mt.row_idx + 1) * (mt.col_idx + 1)) 
    ...: mt = mt.annotate_rows(row_field = hl.str('hello! ') + hl.str(mt.row_idx)) 
    ...:  
    ...:  
    ...: sample_names = hl.delimit(hl.array([hl.str(mt[key_field]) for key_field in mt.col_key]), '_').collect() 
    ...: t = mt.localize_entries('entries') 
    ...: t = t.annotate(**{ 
    ...:     s + '_' + field: t.entries[i][field] 
    ...:     for i, s in enumerate(sample_names) 
    ...:     for field in mt.entry 
    ...: }) 
    ...: t = t.drop('entries') 
    ...: df = pd.DataFrame(t.collect()) 
    ...: df.set_index(list(mt.row_key), inplace=True) 
    ...: df                                                                                                                                                                                                         
Out[28]: 
         0_x  1_x row_field
row_idx                    
0          1    2  hello! 0
1          2    4  hello! 1

CuriousGeneticist · December 14, 2021, 7:02pm

Thank you so much for your response @danking! I am trying to export, but it seems to be taking a while. Is this an expensive operation?

danking · December 14, 2021, 7:08pm

How many rows and how many columns are in your matrix table?

CuriousGeneticist · December 14, 2021, 7:17pm

I believe there are at least a few hundred thousand rows…

danking · December 14, 2021, 7:35pm

If you have even a modest 1000 samples (columns), the absolute best case scenario is that your pandas data frame is about this big:

In [5]: 1000 * 500_000 * 8 / 1024 / 1024 / 1024                                                                                                                                                                     
Out[5]: 3.725290298461914

Nearly 3.7 GB. In reality it’s almost certainly bigger.

How many columns do you have? And do you have any code in between:

mt = hl.read_matrix_table(...)

and

t = mt.localize_entries('entries')

?

If you share your whole script I can help you find slow parts, but, in general, you should avoid loading an entire genetics dataset into memory (e.g. into pandas). Most genetics datasets are require the use of streaming operations which Hail automatically uses.

CuriousGeneticist · December 14, 2021, 8:33pm

I just tried this:
new_mt.entries().PBT_GT.export(“gt.tsv”)
and it somehow worked. The file was around 100+ MB although they are a few million lines. Does this file size make sense, or was it truncated?

danking · December 14, 2021, 8:45pm

I can’t say if that file size makes sense because I do not know how many columns/samples you have. In general, I have no reason to believe that command will truncate the output.

If you’re exporting to text, I recommend using gt.tsv.bgz which will automatically compress the file and should be faster.

new_mt.entries().PBT_GT.export("gt.tsv") will create one row per variant, per sample. So you’ll have n_variants * n_samples rows. This is different from the VCF representation which has one row per variant.

Topic		Replies	Views
Write matrix table to a csv file Hail Query & hailctl	7	699	October 21, 2021
Looking for a function to export a matrix table to tsv like export_samples() in hail0.1 Hail Query & hailctl	8	525	July 30, 2020
Matrix table - custom entry field Hail Query & hailctl	5	461	September 11, 2020
Export Specific Column From MT Hail Query & hailctl	4	664	June 24, 2020
Opposite of to_matrix_table_row_major() Hail Query & hailctl	2	290	June 5, 2023

Exporting data from MatrixTable into TSV

Related topics