Hello, I’m using this thread as a starting point.
The above workflow shows how to run a gwas on one chromosome using the ukbiobank bgen file.
My ultimate goal is to run a full gwas on 79,800 phenotypic measures. In this case, functional connectivity from resting state MRI. I have a subset of 11,533 subjects that I’d like to run this on. I could use some help in 1) setting up this analysis and 2) optimizing it.
To start, I’d like to get a gwas running on one pheno measure, which I could then parallelize 79,800 times.
The below code is from the above thread, which I’m using as a starting point, but I’ve changed it a bit for my own application.
import hail as hl
import pandas as pd
import os, sys
bgenFile = 'ukb_imp_chr1_v3.bgen'
sampleFile = 'ukb22875_imp_chr1_22_v3_s487320.sample'
mfiFile = 'ukb_mfi_chr1_v3.txt'
idpsFile = 'ALL_IDPs_i_deconf.csv'
subjsFile = 'good_subjs.csv'
initiate hail environment
hl.init()
# import pheno table
pheno = hl.import_table(idpsFile, delimiter=’,’)
My idpsFile
is a 11,533 x 79,800 matrix csv (subjects by pheno measures). Only data, no column or row names. Do I need to include column and row names for indexing in the data file, or is this something I can annotate using Hail?
Also, what is best practice for running a gwas on all chromosomes? Run a regression separately for each? If not, how would I prepare the data for an all-chomosome run?
I’ll likely have more questions, but let’s start there. Thanks in advance for your help!