Hi, I’m new to Hail. I would like to conduct GWAS (linear regression), however, the genotype data is stored in a HDF5 file. In the the HDF5 file, I have every sample’s allele count at each locus. I think it should be straight forward to use these allele counts to run GWAS in Hail. But I’m not sure:
How can I read HDF5 into Hail? Using h5py python package, I can read the .hdf5 file into python (h5py.File(xx.hdf5,'r')), but how could I pass the allele counts in this HDF5 file to Hail?
Suppose it is possible to pass the data to Hail, how to conduct GWAS using allele counts (which are 0, 1, 2)? I’ll also need to compute PCA, can I do that in Hail using allele counts? How to include other covariates stored in a separate file?
After talking to John offline, I made an error and you would have to attempt to read the HDF5 file into a format that we will be able to import the file into the Hail environment e.g. a vcf file.
Thanks for your reply!
I could use h5py package to read the hdf5 file into python, and extract the genotype data as a numpy matrix, so could I somehow convert it to Hail matrix table or other format that Hail could understand?