I am trying to run a GWAS using Hail and am running into some problems when trying to run the analyses by chromosome. I have already converted each of the chromosome VCF files into their corresponding MatrixTables. When I load a single chromosome via the following code, everything works fine:
path_to_matrix_table = f"{bucket}/data/chr_1.mt"
mt = hl.read_matrix_table(path_to_matrix_table)
path_to_phenotype = f"{bucket}/data/phenotypes.tsv"
table = hl.import_table(path_to_phenotype, types={‘year_birth’: hl.tint32, ‘sex_binary’: hl.tint32}, delimiter=‘\t’)).key_by(“sample”)
Combine the phenotype data into the Matrix Table
mt = mt.annotate_cols(pheno = table[mt.s])
Call built in QC methods
mt = hl.sample_qc(mt)
mt = hl.variant_qc(mt)
Filter samples for missingness
mt = mt.filter_cols((mt.sample_qc.call_rate >= 0.97))
print(‘After filter, %d/%d samples remain.’ % (mt.count_cols(), initial_num_samples))
Filter out variants with allele frequency lower than 0.03 and HWE value less than 1e-6
mt = mt.filter_rows((mt.variant_qc.AF[1] > 0.03) & (mt.variant_qc.p_value_hwe > 1e-6))
print(‘After filter, %d/%d variants remain.’ % (mt.count_rows(), initial_num_variants))
Perform GWAS using logistic regression for phenotype of sex with no covariates
gwas = hl.logistic_regression_rows(test=‘wald’, y=mt.pheno.sex_binary,
x=mt.GT.n_alt_alleles(),
covariates=[1.0])
gwas.row.describe()
However, when I instead try and create a combined matrix table via:
Load and combine multiple matrix tables from workspace bucket
Create a empty list in which individual MatrixTable’s for each chromosome will be stored
chr_mt_list =
Retrieve MatrixTables for specified range of chromosomes and append them to list (chr1 and chr2)
for i in range(1, 3):
curr_path = f"{bucket}/data/chr_{i}.mt"
curr_chr_mt = hl.read_matrix_table(curr_path)
chr_mt_list.append(curr_chr_mt)
Combine all MatrixTables into one MatrixTable and print out number of samples (cols) and number of variants (rows)
mt = hl.MatrixTable.union_rows(*chr_mt_list)
initial_num_variants = mt.count_rows()
initial_num_samples = mt.count_cols()
print(‘Samples: %d Variants: %d’ % (mt.count_cols(), mt.count_rows()))
and then try to run the QC methods (i.e. mt = mt.filter_rows((mt.variant_qc.AF[1] > 0.03) & (mt.variant_qc.p_value_hwe > 1e-6))), the code gets stuck and doesn’t update at all. The only displayed message is: [Stage 2:> (0 + 1) / 1]. And this does not update.
What should I do?