PLINK glm vs linreg3

Hey, curious about trying linreg3 for GWAS on UK Biobank and wondering how it compares to PLINK 2.0’s glm with logistic and fallback Firth regression?

linreg3 was a Hail 0.1 function that ran regressions on blocks of variants to use BLAS3 rather than BLAS2 routines to fit the linear model, which improved performance significantly. Hail 0.2 (which should definitely be used instead of Hail 0.1, and has seen ~8 person-years of work since 0.1 was deprecated) includes this functionality in the linear_regression_rows function when multiple y values are passed.

However, I should note that although 0.2 is superior to 0.1 on almost every performance benchmark, import_bgen / linear regression is one of the only benchmarks where 0.2 is slightly slower than 0.1. This is due to the presence of an extra copy into Hail’s off-heap data representation, and the fact that our current infrastructure is incompatible with a special-cased BGEN=>linear regression optimization we used in 0.1. I expect that in the next few months, we’ll be able to reclaim 0.1’s performance, or improve it by using a faster on-disk format (zlib decompression is super slow!).

In order to write something similar to PLINK’s --glm, you’d need to explicitly use linear or logistic regression depending on the type of phenotype (quantitative or binary). We also don’t have stuff to do firth fallback right now, though I think that’s not a super hard feature to add.

Also, to answer what may be your original question, I believe that 0.1’s linreg3 and 0.2’s linear_regression_rows are faster than PLINK for multiple-phenotype regression, due to the BLAS3 optimizations I mentioned above. Hail will probably be a little slower (in terms of CPU time, Hail’s scalability means you can use a cluster more easily to parallelize compute) for logistic regression, though may be faster if you’re running on multiple phenotypes at once, since I think PLINK will need to go to disk for every phenotype, while Hail can batch them.

Thanks for the thorough response. Clarification question, what do you mean PLINK has to do I/O while Hail can batch phenotype? I can set up qsub on Sun Grid Engine to run one phenotype at a time no problem. My main concern is for large bgens ~150GB per chrom, I can’t reasonably merge them into one file and perform PCA and logistic regression. I understand from Ben Neale’s work using your software it is quite feasible to run logreg using all of the chromosomes at once but I’m a bit confused on how it would incorporate them all. I think PLINK would need to run them one by one because the BGENs are so large or else hit OOM errors.

Maybe I’m misunderstanding something? Perhaps a throwaway example command might give some clarification. I’m probably just not getting it.

Jim Havrilla

what do you mean PLINK has to do I/O while Hail can batch phenotype?

I’m speaking to computational efficiency, which is related to cost when you’re paying marginally per CPU hour. It sounds like that’s not your core concern, though.

Hail’s scalability means that you can write the same code, whether you’re running on a 1GB dataset or 1T. You can process everything at once without having to worry about qsub/SGE!

Roughly the code would look like the below:

import hail as hl
list_of_bgens = [f'/path/to/chr{chrom}.bgen' for chrom in range(1, 23)]
mt = hl.import_bgen(list_of_bgens,
    entry_fields=('GT', 'GP'), 
    sample_file='/path/to/sample/file')
# do any necessary sample/variant filtering here

# join phenotypes into matrix table
pheno_table = hl.import_table('/path/to/pheno_file.tsv', ...)
mt = mt.annotate_cols(phenos = pheno_table[mt.s])

gwas_results = hl.linear_regression_rows(
   y=[mt.pheno.pheno1, mt.pheno.pheno2, ...],
   x=mt.GP,
   covariates=[1, ...pcs...])

This will scale to the full biobank. Nothing in Hail assumes that you can put an entire dataset in memory – methods are out-of-core, running row-by-row through the data.

Thanks, and this is probably a stupid question (and possibly a bit unrelated), but is there a way to pass multiple BGENs in for PCA and logistic/linear regression for running on the whole genome at once instead of doing QC and merging into one unified file as with PLINK (at least that’s how I understand it)?

Yes. import_bgen accepts a list of files.

Thanks! Good to know I can use multiple files for PCA and linreg at once! Another probably similar stupid question…can I pass multiple gVCFs in for a SKAT test? Or do I have to joint call them/merge using GLnexus or GATK4? Because currently merging them is a big bottleneck for me.

You’ll want to merge first before running SKAT. However, Hail has a joint caller: https://hail.is/docs/0.2/experimental/vcf_combiner.html?highlight=vcf_combiner. The output differs from GATK4, etc., it is a “sparse MatrixTable” format that preserves REF blocks and is designed to support very large (1m+ sample) datasets. You’ll want to densify after joint calling to get an analog of the traditional joint-called VCF format, see the docs.

Ok, so I should not use import_gvcfs (though the run did finish) Another question that was a bit unclear to me from the documentation here (https://hail.is/docs/0.2/getting_started.html#installation). If I am running this on the head node of an SGE based cluster (I would assume that’s a spark cluster) is the Spark jobs page that I can access (in my case) from port 4040 based solely on the head node, or does Spark intuitively use the machine resources on the other compute nodes? Normally I have to qsub on my cluster and I know that spark-qsub and spark-qrsh exist, and I can’t seem to see how it’s distributing the memory usage with qstat. I just want to be certain I’m not going to be crashing the head node or anything like that if I run a full, non-test-size job – for now I’m just looking at one gene across all the gVCFs and it seems to load successfully.

I also want to say you guys have been insanely helpful and I really appreciate it.

Spark and SGE are distinct cluster management systems. If a spark-qsub binary exists, then I imagine your IT department has somehow built a Spark-on-SGE system. Usually there’s a binary called spark-submit which can accept a python file which starts a Spark job. Can you spark-qsub /path/to/hail-script.py?

We also maintain tools for using managed Spark clusters in Google Cloud.

You can also run Hail on one computer with many cores. Hail will use all the cores available on that machine.

Just to be a bit more explicit, try this:

cat >my-script.py <<EOF
import hail as hl
mt = hl.balding_nichols_model(3, 10, 10)
mt.show()
EOF
spark-qsub my-script.py

Yeah we don’t have spark-qsub set up. I was more curious how I could use hail on the cluster to make best use of my resources. My guess is when I run it, it is just using the head node. I guess I’m asking should I be starting an interactive session or qsub job so that Hail can have access to more resources when I run it? Or am I limiting it or perhaps breaking it if I do so?

You can definitely run on a beefy interactive node or in a single qsub job that requests a node with lots of resources.

This will be running Spark in local mode, which will allocate all cores it finds by default. You can set the cores used by using the following in your python script:

import hail as hl
hl.init(master='local[N]')

Caveat being that I ran it on the head node and it is able to create matrix tables for a small gene in 49k VCFs no problem (still working on fixing vcf_combiner though), but I haven’t tried the whole exome yet, so I don’t know how well that will work out.

These questions are best answered by the IT department that supports and maintains your SGE cluster. I’ve never seen Spark running on SGE before, so I suspect whatever they did is custom to your institution.

Awesome, thanks! Once I find a quick workaround for merging those gVCF matrix tables into one, I’m in business.

Jim Havrilla

Thanks I think Tim answered my question pretty well. I do want to thank you all again for being very polite and trying to help me out. It’s definitely appreciated.

Jim Havrilla