Read HDF5 file and run GWAS

Hi, I’m new to Hail. I would like to conduct GWAS (linear regression), however, the genotype data is stored in a HDF5 file. In the the HDF5 file, I have every sample’s allele count at each locus. I think it should be straight forward to use these allele counts to run GWAS in Hail. But I’m not sure:

  1. How can I read HDF5 into Hail? Using h5py python package, I can read the .hdf5 file into python (h5py.File(xx.hdf5,'r')), but how could I pass the allele counts in this HDF5 file to Hail?
  2. Suppose it is possible to pass the data to Hail, how to conduct GWAS using allele counts (which are 0, 1, 2)? I’ll also need to compute PCA, can I do that in Hail using allele counts? How to include other covariates stored in a separate file?

Here is one HDF5 data that you can check:

Thank you very much for your help!

Hi Liverpool! Thank you for your interest in Hail!

  1. Yes, you should be able to import HDFS files into Hail which will then be formatted into a Hail matrix table for ease in computation
  2. As for a linear regression of genotype or allele counts, I would highly suggest looking through our GWAS tutorial :slight_smile:
    If you would like a video tutorial :

Kumar, they are not talking about “HDFS”. It’s a 5, not an S.

After talking to John offline, I made an error and you would have to attempt to read the HDF5 file into a format that we will be able to import the file into the Hail environment e.g. a vcf file.

An HDF5 file importer is something we should add in the future, but unfortunately we haven’t yet.

Thanks for your reply!
I could use h5py package to read the hdf5 file into python, and extract the genotype data as a numpy matrix, so could I somehow convert it to Hail matrix table or other format that Hail could understand?

It’s not super easy to go directly from numpy => hail – it may be easier to go through a text intermediate and use hl.import_matrix_table.

Thank you @tpoterba! I will try that.