Hello - I’m trying to run a large number of GWASs, but I’m currently testing one to make sure I understand the output. I sent one job off to the cluster and returned a day later to an output that says that 0 rows were saved. What things might I check to trouble shoot this? Thanks!
Running on Apache Spark version 2.4.1
SparkUI available at http://cn2434:4040
Welcome to
__ __ <>__
/ /_/ /__ __/ /
/ __ / _ `/ / /
/_/ /_/\_,_/_/_/ version 0.2.64-1ef70187dc78
2021-07-31 16:41:12 Hail: INFO: Reading table without type imputation
Loading field 'locus' as type locus<GRCh37> (user-supplied)
Loading field 'bp' as type str (user-supplied)
Loading field 'alleles' as type str (not specified)
Loading field 'info' as type float64 (user-supplied)
2021-07-31 16:41:14 Hail: INFO: Number of BGEN files parsed: 1
2021-07-31 16:41:14 Hail: INFO: Number of samples in BGEN files: 487409
2021-07-31 16:41:14 Hail: INFO: Number of variants across all BGEN files: 7402791
2021-07-31 16:41:59 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-07-31 16:42:01 Hail: INFO: linear_regression_rows: running on 11224 samples for 1 response variable y,
with input variable x, and 1 additional covariate...
2021-07-31 21:26:51 Hail: INFO: wrote table with 0 rows in 1450 partitions to /data/dmoracze/gwas/test_gwas_chr1
Total size: 29.75 KiB
* Rows: 29.74 KiB
* Globals: 11.00 B
* Smallest partition: 0 rows (21.00 B)
* Largest partition: 0 rows (21.00 B)
I think I’ve checked all the input data to make sure everything is in order, though the final matrix table is so large that I can’t visualize what’s in there. Here is the script I used to give the above output:
# read in and prep the geno matrixtable
import hail as hl
import pandas as pd
import os, sys
from glob import glob
hl.init(spark_conf={'spark.driver.memory': '64g'})
# get bgen files together
top_dir = '/data/ukbgene'
bgenFile = glob(f'{top_dir}/bgen/ukb_imp_chr1_v3.bgen')
sampleFile = f'{top_dir}/ukb22875_imp_chr1_22_v3_s487320.sample'
mfiFile = glob(f'{top_dir}/mfi/ukb_mfi_chr1_v3.txt')
# fc idps
idpsFile = '/data/dmoracze/gwas/ALL_IDPs_i_deconf.csv'
# subject list
subjsFile = '/data/dmoracze/gwas/good_subjs.csv'
with open(subjsFile) as f:
subjs = f.read().split('\n')
# read in mfi files
mfi_tsv = glob(f'{top_dir}/mfi/chr1_mfi_info.tsv')
mfi_info = hl.import_table(mfi_tsv,
types={
'locus':hl.tlocus(reference_genome='GRCh37'),
'bp':hl.tstr, 'info':hl.tfloat64
},delimiter='\t')
mfi_info = mfi_info.key_by(locus=mfi_info.locus,alleles = mfi_info.alleles.split(','))
# read in the bgen files
genotype_info = hl.import_bgen(bgenFile, entry_fields=['dosage'], sample_file=sampleFile)
# add mfi to variants
geno_mt = genotype_info.annotate_rows(mfi=mfi_info[genotype_info.row_key])
geno_mt = geno_mt.filter_rows(geno_mt.mfi.info>0.8)
geno_subset_mt = geno_mt.filter_cols(hl.literal(subjs).contains(geno_mt.s))
# just do 1 row for now
batch_size = 1
batch_index = 0
# read in the pheno matrix table
phenos_mt = hl.read_matrix_table('idps_i_deconf.mt')
phenos_mt = phenos_mt.add_col_index()
pheno_start = batch_size * batch_index
pheno_end = pheno_start + batch_size
# grab the current batch of IDPs
phenos_mt = phenos_mt.filter_cols((phenos_mt.col_idx >= pheno_start) & (phenos_mt.col_idx < pheno_end))
phenos_table = phenos_mt.annotate_rows(phenos = hl.agg.collect(phenos_mt.x)).rows()
# subset to only include subjects from the subjects list
geno_subset_mt = geno_subset_mt.annotate_cols(phenos = phenos_table[geno_subset_mt.s].phenos)
# do the gwas
batch_results = hl.linear_regression_rows(
[geno_subset_mt.phenos[i] for i in range(batch_size)],
geno_subset_mt.dosage,
covariates=[1.0],
pass_through=[geno_subset_mt.rsid],
block_size=128)
# save the results
batch_results.write('/data/dmoracze/gwas/test_gwas_chr1')
hl.stop()
I think you’re running into the second warning at the docs for linear_regression_rows. You very likely have different patterns of missingness in each of your phenotypes. Instead, try this:
batch_results = hl.linear_regression_rows(
[[geno_subset_mt.phenos[i]] for i in range(batch_size)],
See the warning box at the docs for more details on what this does.
I should not have any missingness in the phenotypes as I filtered all of those out before creating the matrix table.
No worries, I’m relatively new to this analysis. I got that number from this thread.
I’m checking how many rows this loses. However, counting the rows on the geno matrix tables takes forever (eg geno_mt.count_rows(). I’m not sure if this is normal or not?
Hmm. I would hope that it is really fast, but it’s possible the optimizer isn’t eliminating the read of the genotypes. Try using this for the import BGEN line instead:
genotype_info.count_rows() was very fast one 1 chromosome, but when I tried to read in and count 22 it was taking a really long time. There are 7402791 rows in this matrix table.
geno_mt.count_rows() is taking a really long time (more than 15 mins currently, still running).
Hmm. You might try importing the mfi table with the same or more than the number of partitions in the genotype matrix table. You can use the min_partitions parameter to import_table.
@danking I changed the mfi filter to 0.01 and I’m still getting results with 0 rows. I changed it something ridiculously small (1e-10) and I’m still getting 0 rows in the mfi-filtered matrix table:
Rows for which the condition evaluates to missing are removed from the dataset. You can get a quick sense of how many are missing with
geno_mt.mfi.info.summarize()
Your alleles array looks wrong to me. An alleles array should be an array containing all the possible alleles, but your alleles don’t look like real alleles to me: ['AC'. I think there’s just a string parsing issue here. I think you could fix this with:
It looks like the text file you started with had the Python representation of an array of strings. That’s kind of hard to parse in Hail, unfortunately. If you instead had ["AC", "A"] (with double quotes) you could use hl.parse_json(the_string, dtype=hl.tarray(hl.tstr))
Alleles need to be in the same order to count as the same key. It’s certainly possible to join by locus if you know that the loci are distinct, or to convert the alleles to an unordered set to join on that.