The problem
I am running a linear regression as the last step in a burden testing algorithm, and it runs normally through a large computation (judging by the numbers describing the stages and the job’s progress through them, e.g. [x + y] / 2586
), then it hangs in what appears to be a collect step of the linear regression (again, judging by the number of stages – the progress indicator says [0 + 1] / 17
).
When it hangs, memory and CPU drop, and it just runs indefinitely, without returning results or crashing. This same code has worked before, on the same dataset, but stopped working after I added the gnomAD annotations (even after I dropped them post-filtering). I will post more detailed output after rerunning the job – I didn’t capture the original output.
Details below. Thanks!
The method
- Using a large MatrixTable of variants (rows) and genomes (columns)
- Adding the gnomAD genome sites annotations to the rows
- Filtering for rare variants and by “most severe consequence” using the gnomAD annotations, then drop the annotations to keep row sizes manageable
- Annotating with HGDP data to get gene symbols, grouping by those symbols, and filtering out those without any variants or gene symbol
- Adding covariates (age, sex, ancestry), primary components, and height to the
burden
table - Running a linear regression
The code
This isn’t the full source code, but just the relevant pieces:
genomes = hl.read_matrix_table('<my_matrix_table>')
genomes = genomes.key_rows_by(genomes.locus, genomes.alleles)
db = hl.experimental.DB(region='us', cloud='gcp')
genomes = db.annotate_rows_db( genomes, 'gnomad_genome_sites' )
genomes = genomes.filter_rows(hl.agg.call_stats(genomes.GT, genomes.alleles).AF[1] < 0.01)
genomes = genomes.filter_rows( (genomes.gnomad_genome_sites.vep.most_severe_consequence == 'start_lost') |\
(genomes.gnomad_genome_sites.vep.most_severe_consequence == 'stop_gained') |\
(genomes.gnomad_genome_sites.vep.most_severe_consequence == 'splice_donor_variant') |\
(genomes.gnomad_genome_sites.vep.most_severe_consequence == 'splice_acceptor_variant') )
genomes = genomes.drop( 'gnomad_genome_sites' )
...
burden = (
genomes
.group_rows_by(genomes.vep_info.gene_symbol)
.aggregate(n_variants = hl.agg.count_where(genomes.GT.n_alt_alleles() > 0))
)
burden = burden.filter_rows( hl.agg.sum( burden.n_variants ) > 0 )
burden = burden.filter_rows( hl.is_missing(burden.gene_symbol), keep=False )
...
covariates = hl.import_table('gs://my_bucket/phenotype.csv', impute=True, delimiter=',').key_by('shipping_id')
covariates = covariates.annotate(sex_binary = hl.if_else( covariates.sex == 'M', 0, 1 ))
covariates = covariates.select(covariates.age, covariates.sex_binary, covariates.hare)
burden = burden.annotate_cols(cov = covariates[burden.s])
...
burden = burden.annotate_cols(pca = pca[burden.s])
burden = burden.annotate_cols(height = height[burden.s])
...
burden_results = hl.linear_regression_rows(
y=burden.height.Height,
x=burden.n_variants,
covariates=[1.0,
burden.cov.sex_binary,
burden.pca.PC1,
burden.pca.PC2,
burden.pca.PC3])
Software versions
Hail 1.2
Dataproc cluster (1 master/2 workers/up do 1000 secondary workers)
gnomAD genome sites annotations (v3.1.x, I believe - Hail doesn’t tell you which version it’s fetching)