ConnectionError after mt.aggregate_cols(hl.agg.collect(…)) and hl.nd.array() in _linear_skat

Dear Hail Team,

I am learning to use Hail to handle large genomic data. Thank you for providing such a wonderful tool to the community .

I am encountering a technical issue. While testing the _linear_skat function with moderately sized data (7944 SNPs and 442075 samples, 76.4M bytes on disk) on my personal computer (MacBook Pro M3, 32G memory), I received ConnectionError messages. It appears to use unreasonably large amount of memory. Below is the simplified code that I believe caused the problem. The issue occurs after the combination of operations mt.aggregate_cols(hl.agg.collect(…)) and hl.nd.array() on a one-column 2D array containing 442075 ones. This does not occur with smaller sample sizes, such as 2000. Although I can work around this issue using other Python functions, I would like to better understand why this happens and learn general principles to avoid such issue when using Hail functions in the future. Below, I have included the full script and error messages to provide all necessary information. If there is a more efficient way to share information, please let me know.

Thank you very much for your assistance.



Hail version: 0.2.129-41126be2df04
Python version: 3.9.6
Java version:
openjdk version “11.0.22” 2024-01-16 LTS
OpenJDK Runtime Environment Zulu11.70+15-CA (build 11.0.22+7-LTS)
OpenJDK 64-Bit Server VM Zulu11.70+15-CA (build 11.0.22+7-LTS, mixed mode)


import hail as hl

#Load the matrix table
#A biger data (7944 SNPs and 442075 samples, 76.4M bytes on disk) that leads to error message when running the code below
mt = hl.read_matrix_table(‘/Users/zheyangwu/Desktop/temp/mt_chr22’)
#A smaller data (7944 SNPs and 2000 samples, 578k bytes) that does not cause error message
#mt = hl.read_matrix_table(‘/Users/zheyangwu/Desktop/temp/mt_1k_cases_controls_chr22’)

###Data processing flow of the _linear_skat function in Hail
#with much simplification to focus on the essential code that leads
#to error messages
covariates = [1.0]

#redefine the matrix table, set up column fields
mt = mt._select_all(
col_exprs=dict(covariates=covariates) #covariates extends [1.0] to [1.0, 1.0, 1.0, …] for all samples
) #This works fine

#Retrieve covmat
covmat = mt.aggregate_cols(
hl.agg.collect(, #matrix of covariates
) #This works fine

covmat2 = hl.nd.array(hl.literal([[1.0]] * 442075)) #This works fine if running before the code below.

covmat=hl.nd.array(covmat) #!!!This gets the ConnectionError messages when using the larger data with 442,075 samples.

Error message:


Hi @Zheyang,

Thank you for your kind words, it’s very fulfilling to hear that our work is benefitting the community.

I think that’s a bug because if you use _localize=True then the show on hl.nd.array works. I’m sorry for the inconvenience. You can follow the progress at the link below:


PS: We’re going to be transitioning this forum to GitHub in the near future. We monitor github issues more regularly than here so you’ll likely get more timely responses if you create an issue or post on our zulip thread! Cheers!