Annotate_cols out of memory issues

ch-kr · December 19, 2019, 3:30pm

data_source = 'broad'
freeze = 5

hardcalls = get_ukbb_data(data_source, freeze, raw=False, split=True, adj=False)

sample_map_ht = hl.read_table(array_sample_map_ht(data_source, freeze))
sample_map = hl.import_table(array_sample_map(freeze), delimiter=',', quote='"')
sample_map = sample_map.key_by(s=sample_map.eid_26041)

hardcalls = hardcalls.annotate_cols(**sample_map_ht[hardcalls.s])
hardcalls = hardcalls.annotate_cols(**sample_map[hardcalls.ukbb_app_26041_id])

hardcalls = hardcalls.select_cols('batch', 'batch.c')
hardcalls = hardcalls.transmute_cols(batch_num=hardcalls['batch'],
                             batch=hardcalls['batch.c'])
hardcalls.cols()._force_count()

also crashed

tpoterba · December 20, 2019, 12:24am

I cannot replicate the memory usage with the above pipeline on simulated data with 500K columns:

my attempt:

1078:2019-12-19 19:15:38 root: INFO: RegionPool: REPORT_THRESHOLD: 11.6M allocated (128.0K blocks / 11.5M chunks), thread 15: Thread-4
1079:2019-12-19 19:15:38 root: INFO: RegionPool: REPORT_THRESHOLD: 11.7M allocated (192.0K blocks / 11.5M chunks), thread 15: Thread-4
1080:2019-12-19 19:15:38 root: INFO: RegionPool: REPORT_THRESHOLD: 11.8M allocated (256.0K blocks / 11.5M chunks), thread 15: Thread-4
1081:2019-12-19 19:15:38 root: INFO: RegionPool: REPORT_THRESHOLD: 11.8M allocated (320.0K blocks / 11.5M chunks), thread 15: Thread-4
1082:2019-12-19 19:15:38 root: INFO: RegionPool: REPORT_THRESHOLD: 11.9M allocated (384.0K blocks / 11.5M chunks), thread 15: Thread-4
1083:2019-12-19 19:15:38 root: INFO: RegionPool: REPORT_THRESHOLD: 11.9M allocated (448.0K blocks / 11.5M chunks), thread 15: Thread-4
1084:2019-12-19 19:15:38 root: INFO: RegionPool: REPORT_THRESHOLD: 16.0M allocated (4.5M blocks / 11.5M chunks), thread 15: Thread-4
1085:2019-12-19 19:15:38 root: INFO: RegionPool: REPORT_THRESHOLD: 32.6M allocated (13.4M blocks / 19.2M chunks), thread 15: Thread-4
1086:2019-12-19 19:15:39 root: INFO: RegionPool: REPORT_THRESHOLD: 64.0M allocated (33.3M blocks / 30.7M chunks), thread 15: Thread-4
1141:2019-12-19 19:15:40 root: INFO: RegionPool: REPORT_THRESHOLD: 135.7M allocated (70.6M blocks / 65.1M chunks), thread 15: Thread-4
1194:2019-12-19 19:15:42 root: INFO: RegionPool: REPORT_THRESHOLD: 256.0M allocated (152.8M blocks / 103.3M chunks), thread 15: Thread-4
1216:2019-12-19 19:15:44 root: INFO: RegionPool: REPORT_THRESHOLD: 256.0K allocated (256.0K blocks / 0 chunks), thread 67: Executor task launch worker for task 1
1217:2019-12-19 19:15:44 root: INFO: RegionPool: REPORT_THRESHOLD: 512.0K allocated (512.0K blocks / 0 chunks), thread 67: Executor task launch worker for task 1
1218:2019-12-19 19:15:44 root: INFO: RegionPool: REPORT_THRESHOLD: 1.0M allocated (1.0M blocks / 0 chunks), thread 67: Executor task launch worker for task 1
1219:2019-12-19 19:15:44 root: INFO: RegionPool: REPORT_THRESHOLD: 256.0K allocated (256.0K blocks / 0 chunks), thread 71: Executor task launch worker for task 5
1220:2019-12-19 19:15:44 root: INFO: RegionPool: REPORT_THRESHOLD: 512.0K allocated (512.0K blocks / 0 chunks), thread 71: Executor task launch worker for task 5
1221:2019-12-19 19:15:44 root: INFO: RegionPool: REPORT_THRESHOLD: 1.0M allocated (1.0M blocks / 0 chunks), thread 71: Executor task launch worker for task 5

test.log above:

4941:2019-12-18 21:28:11 root: INFO: RegionPool: REPORT_THRESHOLD: 4.8M allocated (192.0K blocks / 4.6M chunks), thread 19: Thread-6
4942:2019-12-18 21:28:11 root: INFO: RegionPool: REPORT_THRESHOLD: 4.9M allocated (256.0K blocks / 4.6M chunks), thread 19: Thread-6
4943:2019-12-18 21:28:11 root: INFO: RegionPool: REPORT_THRESHOLD: 4.9M allocated (320.0K blocks / 4.6M chunks), thread 19: Thread-6
4944:2019-12-18 21:28:11 root: INFO: RegionPool: REPORT_THRESHOLD: 5.0M allocated (384.0K blocks / 4.6M chunks), thread 19: Thread-6
4945:2019-12-18 21:28:11 root: INFO: RegionPool: REPORT_THRESHOLD: 5.1M allocated (448.0K blocks / 4.6M chunks), thread 19: Thread-6
4946:2019-12-18 21:28:11 root: INFO: RegionPool: REPORT_THRESHOLD: 8.1M allocated (3.4M blocks / 4.6M chunks), thread 19: Thread-6
4947:2019-12-18 21:28:11 root: INFO: RegionPool: REPORT_THRESHOLD: 16.0M allocated (8.3M blocks / 7.7M chunks), thread 19: Thread-6
4992:2019-12-18 21:28:12 root: INFO: RegionPool: REPORT_THRESHOLD: 32.0M allocated (24.3M blocks / 7.7M chunks), thread 19: Thread-6
4993:2019-12-18 21:28:12 root: INFO: RegionPool: REPORT_THRESHOLD: 65.2M allocated (36.0M blocks / 29.2M chunks), thread 19: Thread-6
4994:2019-12-18 21:28:13 root: INFO: RegionPool: REPORT_THRESHOLD: 129.7M allocated (78.9M blocks / 50.8M chunks), thread 19: Thread-6
4995:2019-12-18 21:28:14 root: INFO: RegionPool: REPORT_THRESHOLD: 258.7M allocated (164.7M blocks / 94.0M chunks), thread 19: Thread-6
4996:2019-12-18 21:28:16 root: INFO: RegionPool: REPORT_THRESHOLD: 512.0M allocated (334.7M blocks / 177.3M chunks), thread 19: Thread-6
4997:2019-12-18 21:28:21 root: INFO: RegionPool: REPORT_THRESHOLD: 1.0G allocated (673.9M blocks / 350.1M chunks), thread 19: Thread-6
4998:2019-12-18 21:28:31 root: INFO: RegionPool: REPORT_THRESHOLD: 2.0G allocated (1.3G blocks / 692.6M chunks), thread 19: Thread-6
4999:2019-12-18 21:28:51 root: INFO: RegionPool: REPORT_THRESHOLD: 4.0G allocated (2.7G blocks / 1.3G chunks), thread 19: Thread-6
6017:2019-12-18 21:29:31 root: INFO: RegionPool: REPORT_THRESHOLD: 8.0G allocated (5.3G blocks / 2.7G chunks), thread 19: Thread-6
6018:2019-12-18 21:30:51 root: INFO: RegionPool: REPORT_THRESHOLD: 16.0G allocated (10.6G blocks / 5.4G chunks), thread 19: Thread-6
6019:2019-12-18 21:33:41 root: INFO: RegionPool: REPORT_THRESHOLD: 32.0G allocated (21.3G blocks / 10.7G chunks), thread 19: Thread-6

Can I have a log file from running the above script, to generate a better comparison? From the logs, it looks like there’s an aggregation and scan in there.

tpoterba · December 20, 2019, 12:46am

Aha! I think I found the problem – you were doing a foreign-key join on columns here (which is totally fine, you should be able to do this), but that’s implemented using a scan.

And column scans are TOOOOOOTALLY busted.

I can replicate with:

import hail as hl
hl.init(log='oom.log')
mt = hl.utils.range_matrix_table(n_rows=1, n_cols=1000000)
mt = mt.annotate_cols(scan_thing = hl.scan.count())
mt._force_count_rows()

tpoterba · December 20, 2019, 1:40am

OK, this was real bad. Fixed here:

ch-kr · December 20, 2019, 2:22pm

thank you!!!

Topic		Replies	Views
Java error in Jupyter notebook on google cloud Hail Query & hailctl	4	1576	November 30, 2020
py4j.protocol.Py4JNetworkError: Answer from Java side is empty Hail Query & hailctl	23	14178	March 3, 2020
Yarn memory overhead Hail Query & hailctl	4	856	August 16, 2022
Java side is empty Hail Query & hailctl	0	292	July 30, 2023
GCP Py4JNetworkError("Answer from Java side is empty") Hail Query & hailctl	4	1611	August 30, 2018

Annotate_cols out of memory issues

Related topics