Accessing Blockmatrix data in parallel

ch283 · March 9, 2021, 1:31pm

Hi, I have a custom python generator that’s pulling data out of a hail blockmatrix to use for deep learning, as previously discussed here.

This has been working pretty well by itself, but when I try to parallelise the data generator (by running multiple workers), I get the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-4cdb9d3fb71c> in <module>
----> 1 for X,Y in train_loader:
      2     print(X)

~/anaconda3/envs/hailpyt/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    817             else:
    818                 del self._task_info[idx]
--> 819                 return self._process_data(data)
    820 
    821     next = __next__  # Python 2 compatibility

~/anaconda3/envs/hailpyt/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _process_data(self, data)
    844         self._try_put_index()
    845         if isinstance(data, ExceptionWrapper):
--> 846             data.reraise()
    847         return data
    848 

~/anaconda3/envs/hailpyt/lib/python3.6/site-packages/torch/_utils.py in reraise(self)
    383             # (https://bugs.python.org/issue2651), so we work around it.
    384             msg = KeyErrorMessage(msg)
--> 385         raise self.exc_type(msg)

TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ch283/anaconda3/envs/hailpyt/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ch283/anaconda3/envs/hailpyt/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data = next(self.dataset_iter)
  File "/data/ch283/MyIterableDataset.py", line 62, in __iter__
    yield self.__get_data(batch)
  File "/data/ch283/MyIterableDataset.py", line 36, in __get_data
    X = self.bm.filter_cols(batch).to_numpy()
  File "/data/ch283/modelfit_norm.py", line 36, in to_numpy
    bm.tofile(uri)
  File "<decorator-gen-1467>", line 2, in tofile
  File "/home/ch283/anaconda3/envs/hailpyt/lib/python3.6/site-packages/hail/typecheck/check.py", line 614, in wrapper
    return __original_func(*args_, **kwargs_)
  File "/home/ch283/anaconda3/envs/hailpyt/lib/python3.6/site-packages/hail/linalg/blockmatrix.py", line 1170, in tofile
    Env.backend().execute(BlockMatrixWrite(self._bmir, writer))
  File "/home/ch283/anaconda3/envs/hailpyt/lib/python3.6/site-packages/hail/backend/spark_backend.py", line 296, in execute
    result = json.loads(self._jhc.backend().executeJSON(jir))
  File "/home/ch283/anaconda3/envs/hailpyt/lib/python3.6/json/__init__.py", line 348, in loads
    'not {!r}'.format(s.__class__.__name__))
TypeError: the JSON object must be str, bytes or bytearray, not 'JavaMap'

Wondering if anyone can shed some light on this. Is it due to multiple workers trying to interact with the blockmatrix at once?

danking · March 9, 2021, 2:21pm

This is a Hail bug, I’ll put a fix in ASAP and respond here.

Can you share what version of Hail you are using? hl.version() is sufficient

ch283 · March 9, 2021, 2:33pm

Thanks!

hl.version() gives me 0.2.50-32fc1de02d32. I may be due for an update.

danking · March 9, 2021, 3:03pm

Can you try the latest version of Hail, 0.2.63? I’m having trouble reproducing this error on that version.

ch283 · March 9, 2021, 3:34pm

Yes, I just upgraded and tried it - same error.

danking · March 9, 2021, 4:06pm

I’m having trouble replicating this. This works for me, does it work for you?

import hail as hl

mt = hl.balding_nichols_model(1, 2, 2)
bm = hl.linalg.BlockMatrix.from_entry_expr(mt.GT.n_alt_alleles())
bm = bm.filter_cols([0])
bm.tofile('/tmp/foo')

Can you create a small example that also fails?

ch283 · March 9, 2021, 4:19pm

That works for me too. I’m trying to come up with a small example but it’s a bit challenging as I don’t get that error until parallelisation is introduced - I can get a numpy array from a blockmatrix no problem with only one worker. I’ll see if I can come up with something.

danking · March 9, 2021, 4:45pm

Ahh, yes, Hail relies on Py4J currently which is not thread-safe. I actually have a fix in progress for this. Let me get back to you.

danking · March 9, 2021, 6:10pm

Are the worker “processes” actual Python processes or are they threads?

ch283 · March 9, 2021, 6:47pm

I believe they’re Python processes. I’ve subclassed PyTorch’s IterableDataset class.

tpoterba · March 9, 2021, 7:09pm

Hail’s backend isn’t thread-safe right now. We’ll fix that by synchronizing in calls to the evaluation method, but this will probably mean that you don’t benefit much from the parallelism of pytorch.

amanas · September 5, 2022, 7:34am

Hi @ch283 ,

because you are working with Pytorch, maybe this Feature Request could be of interest.

Regards,
Andrés.

ag14774 · April 13, 2023, 9:49am

I am running hail locally on a big machine with 94 cores. I am planning to run some hail code inside a ThreadPoolExecutor essentially each thread working on independent .mt files (reading non-block gzip files and converting each one to a .mt file) . Would that work?

tpoterba · April 13, 2023, 12:47pm

How many matrixtables do you want to generate? I don’t think this is going to work with a ThreadPoolExecutor in Python, but Hail does have some utilities for co-scheduling lots of writes at the same time.

ag14774 · April 13, 2023, 1:14pm

It’s about 100 .gz files that I need to read and convert to .mt files. Is there a better way to do it?

ag14774 · April 14, 2023, 6:20am

It seems that ThreadPoolExecutor speeds it up quite a bit. And I can see multiple jobs running in parallel in the spark dashboard. Without it, there is only one job at a time running

ag14774 · April 14, 2023, 9:11am

Would you be able to share some of these utilities for scheduling multiple writes together?

tpoterba · April 14, 2023, 1:59pm

Through an oversight, it appears not to be documented, but you can use:

hl.experimental.write_matrix_tables(mts, path_prefix)

tpoterba · April 14, 2023, 1:59pm

If TPE is working for you, though, no need to break it!

ag14774 · April 14, 2023, 3:59pm

TPE works like 90% of the time. Sometimes it crashes randomly. Maybe something that hail is using is not thead-safe. It usually crashes with some error saying it could no unify some type X to some type Y. I hope that the times it works, it does not somehow produce readable but incorrect results

Topic		Replies	Views
Connection Errors Hail Query & hailctl	8	657	August 2, 2023
py4j.protocol.Py4JNetworkError: Answer from Java side is empty: hail, download files Hail Query & hailctl	18	5669	November 29, 2021
Errors when computing sample qc Hail Query & hailctl	0	226	October 17, 2023
BlockMatrix to_numpy for small region throws OutOfMemoryError: Java heap space Hail Query & hailctl	0	156	March 3, 2023
SSLException: connection reset during matrixtable.write Hail Query & hailctl	2	671	June 6, 2019

Accessing Blockmatrix data in parallel

Related topics