GWAS results with 0 rows

dmoracze · August 4, 2021, 5:45pm

Hello - I’m trying to run a large number of GWASs, but I’m currently testing one to make sure I understand the output. I sent one job off to the cluster and returned a day later to an output that says that 0 rows were saved. What things might I check to trouble shoot this? Thanks!

Running on Apache Spark version 2.4.1
SparkUI available at http://cn2434:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.64-1ef70187dc78
2021-07-31 16:41:12 Hail: INFO: Reading table without type imputation
  Loading field 'locus' as type locus<GRCh37> (user-supplied)
  Loading field 'bp' as type str (user-supplied)
  Loading field 'alleles' as type str (not specified)
  Loading field 'info' as type float64 (user-supplied)
2021-07-31 16:41:14 Hail: INFO: Number of BGEN files parsed: 1
2021-07-31 16:41:14 Hail: INFO: Number of samples in BGEN files: 487409
2021-07-31 16:41:14 Hail: INFO: Number of variants across all BGEN files: 7402791
2021-07-31 16:41:59 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-07-31 16:42:01 Hail: INFO: linear_regression_rows: running on 11224 samples for 1 response variable y,
    with input variable x, and 1 additional covariate...
2021-07-31 21:26:51 Hail: INFO: wrote table with 0 rows in 1450 partitions to /data/dmoracze/gwas/test_gwas_chr1
    Total size: 29.75 KiB
    * Rows: 29.74 KiB
    * Globals: 11.00 B
    * Smallest partition: 0 rows (21.00 B)
    * Largest partition:  0 rows (21.00 B)

I think I’ve checked all the input data to make sure everything is in order, though the final matrix table is so large that I can’t visualize what’s in there. Here is the script I used to give the above output:

# read in and prep the geno matrixtable
import hail as hl
import pandas as pd
import os, sys
from glob import glob

hl.init(spark_conf={'spark.driver.memory': '64g'})

# get bgen files together
top_dir = '/data/ukbgene'
bgenFile = glob(f'{top_dir}/bgen/ukb_imp_chr1_v3.bgen')
sampleFile = f'{top_dir}/ukb22875_imp_chr1_22_v3_s487320.sample'
mfiFile = glob(f'{top_dir}/mfi/ukb_mfi_chr1_v3.txt')

# fc idps
idpsFile = '/data/dmoracze/gwas/ALL_IDPs_i_deconf.csv'

# subject list
subjsFile = '/data/dmoracze/gwas/good_subjs.csv'
with open(subjsFile) as f:
    subjs = f.read().split('\n')

# read in mfi files
mfi_tsv = glob(f'{top_dir}/mfi/chr1_mfi_info.tsv')
mfi_info = hl.import_table(mfi_tsv,
types={
'locus':hl.tlocus(reference_genome='GRCh37'),
'bp':hl.tstr, 'info':hl.tfloat64
},delimiter='\t')
mfi_info = mfi_info.key_by(locus=mfi_info.locus,alleles = mfi_info.alleles.split(','))

# read in the bgen files
genotype_info = hl.import_bgen(bgenFile, entry_fields=['dosage'], sample_file=sampleFile)

# add mfi to variants
geno_mt = genotype_info.annotate_rows(mfi=mfi_info[genotype_info.row_key])
geno_mt = geno_mt.filter_rows(geno_mt.mfi.info>0.8)
geno_subset_mt = geno_mt.filter_cols(hl.literal(subjs).contains(geno_mt.s))

# just do 1 row for now
batch_size = 1
batch_index = 0

# read in the pheno matrix table
phenos_mt = hl.read_matrix_table('idps_i_deconf.mt')
phenos_mt = phenos_mt.add_col_index()

pheno_start = batch_size * batch_index
pheno_end = pheno_start + batch_size

# grab the current batch of IDPs
phenos_mt = phenos_mt.filter_cols((phenos_mt.col_idx >= pheno_start) & (phenos_mt.col_idx < pheno_end))
phenos_table = phenos_mt.annotate_rows(phenos = hl.agg.collect(phenos_mt.x)).rows()

# subset to only include subjects from the subjects list
geno_subset_mt = geno_subset_mt.annotate_cols(phenos = phenos_table[geno_subset_mt.s].phenos)

# do the gwas
batch_results = hl.linear_regression_rows(
    [geno_subset_mt.phenos[i] for i in range(batch_size)],
    geno_subset_mt.dosage,
    covariates=[1.0],
    pass_through=[geno_subset_mt.rsid],
    block_size=128)

# save the results
batch_results.write('/data/dmoracze/gwas/test_gwas_chr1')

hl.stop()

danking · August 4, 2021, 6:32pm

Hey @dmoracze,

I think you’re running into the second warning at the docs for linear_regression_rows. You very likely have different patterns of missingness in each of your phenotypes. Instead, try this:

batch_results = hl.linear_regression_rows(
    [[geno_subset_mt.phenos[i]] for i in range(batch_size)],

See the warning box at the docs for more details on what this does.

danking · August 4, 2021, 6:45pm

This also is a bit pedantic, but are you sure there are still rows left after:

geno_mt = geno_mt.filter_rows(geno_mt.mfi.info>0.8)

?

dmoracze · August 4, 2021, 7:18pm

Thanks for the response, I’ll try this.

I should not have any missingness in the phenotypes as I filtered all of those out before creating the matrix table.

No worries, I’m relatively new to this analysis. I got that number from this thread.
I’m checking how many rows this loses. However, counting the rows on the geno matrix tables takes forever (eg geno_mt.count_rows(). I’m not sure if this is normal or not?

danking · August 4, 2021, 7:51pm

Hmm. I would hope that it is really fast, but it’s possible the optimizer isn’t eliminating the read of the genotypes. Try using this for the import BGEN line instead:

genotype_info = hl.import_bgen(bgenFile, entry_fields=[], sample_file=sampleFile)

dmoracze · August 4, 2021, 8:03pm

Yeah I wanted to count the rows at the import_bgen level, the MFI filter level, then my subjects subset level.

genotype_info = hl.import_bgen(bgenFile, entry_fields=['dosage'], sample_file=sampleFile)

geno_mt = genotype_info.annotate_rows(mfi=mfi_info[genotype_info.row_key])
geno_mt = geno_mt.filter_rows(geno_mt.mfi.info>0.8)

geno_subset_mt = geno_mt.filter_cols(hl.literal(subjs).contains(geno_mt.s))

genotype_info.count_rows() was very fast one 1 chromosome, but when I tried to read in and count 22 it was taking a really long time. There are 7402791 rows in this matrix table.

geno_mt.count_rows() is taking a really long time (more than 15 mins currently, still running).

danking · August 4, 2021, 8:16pm

Hmm. You might try importing the mfi table with the same or more than the number of partitions in the genotype matrix table. You can use the min_partitions parameter to import_table.

dmoracze · August 4, 2021, 8:19pm

Yeah I can try that, though I didn’t specify min_partitions to import_bgen.

dmoracze · August 5, 2021, 2:37pm

Yes, after this filter I had 0 rows left, changing that now. Thanks!

dmoracze · August 6, 2021, 7:26pm

@danking I changed the mfi filter to 0.01 and I’m still getting results with 0 rows. I changed it something ridiculously small (1e-10) and I’m still getting 0 rows in the mfi-filtered matrix table:

geno_mt = genotype_info.annotate_rows(mfi=mfi_info[genotype_info.row_key])
geno_mt = geno_mt.filter_rows(geno_mt.mfi.info>0.000000001)

mfi_info.show()

locus bp alleles info
locus<GRCh37>	str	array<str>	float64
1:10177	"10177"	["['AC'"," 'A']"]	4.68e-01
1:10235	"10235"	["['TA'"," 'T']"]	2.15e-01
1:10352	"10352"	["['TA'"," 'T']"]	4.48e-01
1:10505	"10505"	["['T'"," 'A']"]	2.30e-01
1:10506	"10506"	["['G'"," 'C']"]	2.30e-01
1:10511	"10511"	["['A'"," 'G']"]	4.38e-01
1:10539	"10539"	["['A'"," 'C']"]	1.86e-01
1:10542	"10542"	["['T'"," 'C']"]	3.98e-01
1:10579	"10579"	["['A'"," 'C']"]	5.79e-01
1:10616	"10616"	["['C'"," 'CCGCCGTTGCAAAGGCGCGCCG']"]	4.68e-01

geno_mt.describe()

----------------------------------------
Global fields:
    None
----------------------------------------
Column fields:
    's': str
----------------------------------------
Row fields:
    'locus': locus<GRCh37>
    'alleles': array<str>
    'rsid': str
    'varid': str
    'mfi': struct {
        bp: str, 
        info: float64
    }
----------------------------------------
Entry fields:
    'dosage': float64
----------------------------------------
Column key: ['s']
Row key: ['locus', 'alleles']
----------------------------------------

Even with those numbers above there should be rows that survive my filter. Any ideas of what I can examine to figure out what’s going on? Thank you.

danking · August 6, 2021, 7:32pm

Hey @dmoracze ,

Rows for which the condition evaluates to missing are removed from the dataset. You can get a quick sense of how many are missing with

geno_mt.mfi.info.summarize()

Your alleles array looks wrong to me. An alleles array should be an array containing all the possible alleles, but your alleles don’t look like real alleles to me: ['AC'. I think there’s just a string parsing issue here. I think you could fix this with:

mfi_info = mfi_info.key_by(locus=mfi.locus, alleles=mfi_info.alleles.map(lambda x: x.replace('[', '').replace(']', '').replace("'", '')))

It looks like the text file you started with had the Python representation of an array of strings. That’s kind of hard to parse in Hail, unfortunately. If you instead had ["AC", "A"] (with double quotes) you could use hl.parse_json(the_string, dtype=hl.tarray(hl.tstr))

dmoracze · August 6, 2021, 7:59pm

Thanks. I’ll work on the string formatting. That shouldn’t affect the info filter though, right?

Indeed, all rows are missing after the filter.

geno_mt.mfi.info.summarize()

7402791 records.

info (float64):
Non-missing	0
Missing	7402791 (100.00%)

danking · August 6, 2021, 8:05pm

You’re joining the tables along their keys which I suspect is locus and alleles. A variant with key:

1:2345 ["AT", "A"]

won’t match an mfi_info table row whose key is:

1:2345 ["['AT'", "'A']"]

dmoracze · August 23, 2021, 3:25pm

Ok, got the quoting issue worked out.

Perhaps another dumb question:

Does 1:2345 ["AT", "A"] match 1:2345 ["A", "AT"] or do the alleles need to be in the same order?

tpoterba · August 23, 2021, 3:28pm

Alleles need to be in the same order to count as the same key. It’s certainly possible to join by locus if you know that the loci are distinct, or to convert the alleles to an unordered set to join on that.

dmoracze · August 23, 2021, 3:57pm

Thanks.

Another one:

If my mfi text file is formatted as

locus	bp	alleles	info
1:20000041	20000041	["A", "T"]	0.0942382
1:20000075	20000075	["T", "C"]	0.466338
1:20000114	20000114	["T", "G"]	0.0951375
1:20000123	20000123	["C", "T"]	0.0546185

How do I read this file as a hail table with the alleles field as an array rather than. a string?

mfi_tsv = glob(f'{top_dir}/mfi/chr1_mfi_info.tsv')
mfi_info = hl.import_table(mfi_tsv,
types={
'locus':hl.tlocus(reference_genome='GRCh37'),
'bp':hl.tstr, 'info':hl.tfloat64, 
},delimiter='\t', min_partitions=18000)

mfi_info.head(10).show()
locus bp alleles info
locus<GRCh37>	str	str	float64
1:10177	"10177"	"["AC", "A"]"	4.68e-01
1:10235	"10235"	"["TA", "T"]"	2.15e-01
1:10352	"10352"	"["TA", "T"]"	4.48e-01
1:10505	"10505"	"["T", "A"]"	2.30e-01

tpoterba · August 23, 2021, 4:00pm

You can pass 'alleles': 'array<str>' in the types field to read it into an array using json parsing rules!

Topic		Replies	Views
GWAS on subset of UKBioBank Hail Query & hailctl	26	1618	July 13, 2021
Export_vcf OutOfMemoryError: Java heap space despite --driver-memory 8g Hail Query & hailctl	26	2857	January 11, 2019
Spark Tweak - to_pandas function Hail Query & hailctl	39	2494	October 23, 2018
Fail to retrieve row information of Hail matrix.table Hail Query & hailctl	5	532	July 22, 2022
EOFException Error in 'count_rows' Hail Query & hailctl	4	381	September 22, 2020

GWAS results with 0 rows

Related topics