SNP dosages to numpy/pandas?

This is a bit grody, but I would re-write BlockMatrix.to_numpy yourself:

import os
import numpy as np
from hail.linalg import BlockMatrix
from hail.utils import new_local_temp_file, local_path_uri

def to_numpy(bm, _force_blocking=False):
    if bm.n_rows * bm.n_cols > 1 << 31 or _force_blocking:
        path = new_temp_file()
        bm.export_blocks(path, binary=True)
        return BlockMatrix.rectangles_to_numpy(path, binary=True)

    path = new_local_temp_file()
    try:
        uri = local_path_uri(path)
        bm.tofile(uri)
        return np.fromfile(path).reshape((bm.n_rows, bm.n_cols))
    finally:
        try:
            os.remove(path)
        except FileNotFoundError:
            pass

1 Like

Sorry if this is a stupid question, but where do I put that?

Dan’s missing a line to monkey-patch our code –

BlockMatrix.to_numpy = to_numpy

Once you do that, your pipeline that calls BlockMatrix.to_numpy will use this patched version.

1 Like

Hi @Kevin_Anderson and Hail community.

Probably it is too late for this answer.

Anyways, if you want to go from genomics datafiles to tfrecords (that you can plug into Tensorflow or Pytorch) vía Hail, probably DNARecords is exactly what you need.

I have created a Feature Request just to check what the Hail community thinks about it.

Hoping to help!

Regards,
Andrés.