Speed up of mt.entries().to_pandas()


i have wrote this code:

hl_samples = hl.array(samples) - individuals
for mt_path in mt_paths:
mt = hl.methods.read_matrix_table(mt_path)

    # filter samples
    mt = mt.filter_cols(hl_samples.contains(mt.s))
    df = mt.entries().to_pandas()

    # format df
    df[['ref','alt']] = pd.DataFrame(df.alleles.tolist(), index=df.index)
    df['GT'] = df['GT.alleles'].apply(_sum)
    df = df.drop(['alleles', 'GT.alleles', 'GT.phased'], axis=1)
    df.columns = ['chrom', 'pos', 'sample', 'DP', 'GQ', 'ref', 'alt', 'GT']
    df = df[['sample', 'chrom', 'pos', 'ref', 'alt', 'GT', 'DP', 'GQ']]

    # append to final df with all variants per sample
    res_df = res_df.append(df, ignore_index = True)

However df = mt.entries().to_pandas() takes 95% of the execution time and this is very slow.
Generally I am converting to panda.dataframe just because I didnt fount a way to do this df operations in HAIL.
Can you suggest me how can I speed it up?


res_df would look like this in the end:

First of all, hail is lazily executed. So when you run to_pandas, it actually executes your whole pipeline at once (reading, filtering, writing to pandas). No matter what you do, that’s going to show up as the majority of the execution time.

Secondly, entries is always going to be relatively slow. You’re restructuring all your data, and converting it to an inefficient representation. The point of having MatrixTables is to avoid having to make a giant entries table.

Thirdly, you can do everything you’re doing here in hail. Consider referencing the cheat sheets here: https://hail.is/docs/0.2/cheatsheets.html to see functions available on MatrixTables.

You’re essentially just using Hail as a VCF importer in the above code, and then doing everything in Pandas. Don’t do that, it’s going to be a bad experience.