Hail stuck when running array data

hyslin · October 3, 2023, 4:50pm

Hi,

Each time I run a Hail GWAS in All of Us with array data, I encounter issues when trying to save a checkpoint or when running logistic regression. I believe my code is concise, and considering the data size in the latest version (v7) of the array data, it should be manageable. However, the program consistently gets stuck, usually at (0 + 48) / 74 .

2023-10-03 16:24:19.862 Hail: INFO: logistic_regression_rows: running wald on 6529 samples for response variable y,
    with input variable x, and 5 additional covariates...
[Stage 4:>                                                        (0 + 48) / 74]

The array data I used was:

# read in mircoarray data
mt = hl.read_matrix_table('gs://fc-aou-datasets-controlled/v7/microarray/hail.mt/')

After annotating this matrix table with my phenotype/ancestry file, conducting some quick QC, and attempting to run logistic regression on each variant, the program gets stuck.

However, if I randomly select some variants across the genome to run the GWAS, the program runs smoothly and at a good speed.

# only use a subset of intervals
mt = hl.filter_intervals(mt,[hl.parse_locus_interval(x,)for x in allintervals[1:4000] ])

…

2023-10-03 16:21:06.348 Hail: INFO: logistic_regression_rows: running wald on 6529 samples for response variable y,
    with input variable x, and 5 additional covariates...
2023-10-03 16:22:20.280 Hail: INFO: wrote table with 745 rows in 64 partitions to /tmp/persist_table4hGORAWhdh
    Total size: 45.17 KiB
    * Rows: 45.15 KiB
    * Globals: 11.00 B
    * Smallest partition: 1 rows (125.00 B)
    * Largest partition:  26 rows (1.47 KiB)

So, I assumed that the size of the matrix table might be the cause of the issue. However, how do I resolve this?

danking · October 4, 2023, 2:14pm

Hey @hyslin ,

I’m sorry to hear Hail is giving you trouble!

How many variants and samples are in this matrix table? mt.count() will give you that. It looks like you have 74 partitions, which is a very very small number of partitions, particularly if you’re working with a large dataset. It also looks like your cluster only has 48 cores, which is very small for working with large datasets. Folks typically use large clusters (100-1000 cores) for brief periods of time. Using an autoscaling policy is really helpful for doing this, but I’m not sure if those are available inside the AoU RWB yet.

Your second example is running logistic regression on just 745 rows, so I’m not surprised that it is faster.

I expect logistic regression to scale roughly linearly in the variants.

Can you also share the full script you ran? There could be innocuous looking things that are slowing down the analysis.

hyslin · October 4, 2023, 8:53pm

Hi danking,

Thanks you for your reply!

I had 1739268 variants and 6529 samples before QC. I didn’t get a chance to check how many variants and samples were left after QC because it took too long to compute. I guess QC is the most time intensive and computation heavy step?

Initially, I used the default Hail Genomic Analysis environment setting plus 10 Preemptible workers. As I increased those preemp. workers to 100, the program no longer got stuck at ( 0 + 48 ) / 74, but it still ran quite slowly and seemed to stop working at (12 + 62 ) / 74. Are there any other configurations I could change to significantly reduce the computation time?

Here I am pasting my full script for the first example:

## import packages
import os
import hail as hl

## set up hail
hl.init(default_reference = "GRCh38")

## get the workspace bucket
bucket = os.getenv("WORKSPACE_BUCKET")
bucket

## what's in my bucket
!gsutil ls {bucket}/data

## read in mircoarray data
mt = hl.read_matrix_table('gs://fc-aou-datasets-controlled/v7/microarray/hail.mt/')

## read ancestry+phenotype file
sample_filename = 'gs://......../data/afa.tsv'
afa = (hl.import_table(sample_filename,
                              types={'research_id':hl.tstr},
                              impute=True,
                              key='research_id')
             )

## annotate ancestry and phenotype
mt = mt.annotate_cols(anc_pheno = afa[mt.s])
mt = mt.filter_cols(hl.is_defined(mt.anc_pheno), keep=True)

## how many variants and samples are there 
mt.count()

## simple QC
mt = hl.variant_qc(mt)
mt = mt.filter_rows(mt.variant_qc.AF[1] > 0.01, keep=True)
mt = mt.filter_rows(mt.variant_qc.call_rate > 0.95, keep=True)

## how many variants and samples are there after QC
#mt.count()

## also skip this
#mt = mt.checkpoint('gs://......./data/output/mt_checkpoint.mt')

## make covariates
covariates = [1.0, mt.anc_pheno.is_male, 
              mt.anc_pheno.PC1, 
              mt.anc_pheno.PC2, 
              mt.anc_pheno.PC3]

## run logistic regression for each row
log_reg = hl.logistic_regression_rows(
    test='wald',
    y=mt.anc_pheno.is_case,
    x=mt.GT.n_alt_alleles(),
    covariates=covariates
)

## save to my bucket
log_reg.export(f'{bucket}/data/output/log_reg.tsv.bgz')

Any suggestions would be greatly appreciated!

danking · October 4, 2023, 9:10pm

If most of your variants survive the filter, I’d expect this to take ~2000 minutes which is over a day.

Since you’re doing the QC process in the same pipeline as the logistic regression, it’s hard to know what the slow part is. Usually folks use two steps:

# do QC ...
mt.rows().select().write('gs://bucket/rows-that-pass-qc.ht')

ht = hl.read_table('gs://bucket/rows-that-pass-qc.ht')
print(f'{ht.count()} rows passing QC')  # counting a table after read is instant

mt = hl.read_matrix_table(...)
mt = mt.semi_join_rows(ht)  # filter to the passing variants

This also lets us diagnose how long we expect this to take.

Hmm. 74 partitions means you’re doing ~23000 logistic regressions per partition (assuming most variants pass filters). That’s a lot of regressions! We need to make this parameter a public & supported parameter, but for now try using _n_partitions to increase the number of partitions:

hl.read_matrix_table(..., _n_partitions=750)

I definitely recommend using checkpoint after importing a non-Hail-format dataset:

sample_filename = 'gs://......../data/afa.tsv'
afa = (hl.import_table(sample_filename,
                              types={'research_id':hl.tstr},
                              impute=True,
                              key='research_id')
             )
afa = afa.checkpoint('gs://...', overwrite=True)

I recommend against directly exporting as generating a compressed text file can be memory intensive. Instead try this:

log_reg.write('f{bucket}/data/output/log_reg.ht')
log_reg = hl.read_table('f{bucket}/data/output/log_reg.ht')
log_reg.export(...)

hyslin · October 9, 2023, 5:28pm

Thank you so much, danking!

I modified my code using the suggestions you provided, and they have been very helpful! I have separated my code into different notebooks for running in the background. However, as I ran the QC notebook, I encountered an issue when writing out columns (samples). Could this be due to a memory issue? The error message appears sporadically, which is quite disappointing. What steps can I take to prevent this from happening?
(for running this notebook I used 100 preemptible workers and 5000 partitions)

The very last step that cause error:

# write out QC'ed variants and samples
mt.rows().select().write(f'{bucket}/data/qc_var.ht', overwrite=True)
mt.cols().select().write(f'{bucket}/data/qc_sam.ht', overwrite=True)

2023-10-08 20:31:10.452 Hail: INFO: wrote table with 1046146 rows in 74 partitions to gs://fc-secure-80e50d85-fbc9-4359-9a7b-e4948a1089e0/data/qc_var.ht
2023-10-08 20:31:11.236 Hail: WARN: cols(): Resulting column table is sorted by 'col_key'.
    To preserve matrix table column order, first unkey columns with 'key_cols_by()'
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1224, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/opt/conda/lib/python3.7/site-packages/py4j/java_gateway.py", line 1229, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving

Topic		Replies	Views
Error summary: HailException Hail Query & hailctl	4	543	September 26, 2022
Speed/streaming problems with Hail 0.2 Hail Query & hailctl	6	778	March 20, 2019
Missing value and logistic regression Hail Query & hailctl	5	790	October 2, 2020
GWAS results with 0 rows Hail Query & hailctl	16	570	August 23, 2021
Improve writing time for GWAS results Hail Query & hailctl	2	461	November 20, 2020

Hail stuck when running array data

Related topics