Speed up of mt.entries().to_pandas()

nonchev · September 11, 2020, 10:57am

Hello,

i have wrote this code:

hl_samples = hl.array(samples) - individuals
for mt_path in mt_paths:
mt = hl.methods.read_matrix_table(mt_path)

    # filter samples
    mt = mt.filter_cols(hl_samples.contains(mt.s))
    df = mt.entries().to_pandas()

    # format df
    df[['ref','alt']] = pd.DataFrame(df.alleles.tolist(), index=df.index)
    df['GT'] = df['GT.alleles'].apply(_sum)
    df = df.drop(['alleles', 'GT.alleles', 'GT.phased'], axis=1)
    df.columns = ['chrom', 'pos', 'sample', 'DP', 'GQ', 'ref', 'alt', 'GT']
    df = df[['sample', 'chrom', 'pos', 'ref', 'alt', 'GT', 'DP', 'GQ']]

    # append to final df with all variants per sample
    res_df = res_df.append(df, ignore_index = True)

However df = mt.entries().to_pandas() takes 95% of the execution time and this is very slow.
Generally I am converting to panda.dataframe just because I didnt fount a way to do this df operations in HAIL.
Can you suggest me how can I speed it up?

Cheers,

nonchev · September 11, 2020, 11:06am

res_df would look like this in the end:

johnc1231 · September 11, 2020, 1:35pm

First of all, hail is lazily executed. So when you run to_pandas, it actually executes your whole pipeline at once (reading, filtering, writing to pandas). No matter what you do, that’s going to show up as the majority of the execution time.

Secondly, entries is always going to be relatively slow. You’re restructuring all your data, and converting it to an inefficient representation. The point of having MatrixTables is to avoid having to make a giant entries table.

Thirdly, you can do everything you’re doing here in hail. Consider referencing the cheat sheets here: https://hail.is/docs/0.2/cheatsheets.html to see functions available on MatrixTables.

You’re essentially just using Hail as a VCF importer in the above code, and then doing everything in Pandas. Don’t do that, it’s going to be a bad experience.

Topic		Replies	Views
Mt.key_cols_by().cols().flatten().to_pandas() is too slow Hail Query & hailctl	5	342	August 7, 2023
Very slow write and count operations after densify Hail Query & hailctl	0	316	January 31, 2023
Speeding up filtering and conversion of BlockMatrix Hail Query & hailctl	0	14	January 16, 2025
Can Hail convert mt.show() output to dataframe Hail Query & hailctl	3	428	January 13, 2023
Prepare hail entries for spark.ml Hail Query & hailctl	3	446	October 29, 2019

Speed up of mt.entries().to_pandas()

Related topics