[Breaking Change] Hail 0.2 removed KinshipMatrix, linear_mixed_regression


#1

(Update: Hail 0.2 has a function linear_mixed_regression_rows which most closely implements the old linear_mixed_regression function)

The KinshipMatrix class is gone. Instead, realized_relationship_matrix and genetic_relatedness_matrix both return a BlockMatrix with improved performance. The gcta_grm and gcta_grm_bin on KinshipMatrix are therefore gone as well. BlockMatrix supports export, as well as and conversion to_numpy for further processing if small enough (less than 46k samples, assuming sufficient memory on master).

The function linear_mixed_regression is also gone, in favor of an approach that is far more flexible and scalable. The core of this approach is the new class LinearMixedModel. While it’s already possible to implement the linear_mixed_regression in Python using this class, we will soon add additional Python functions / scripts to simplify the process and post an overview on the forum.


Log of breaking changes in 0.2 beta
#2

I am new to hail and was really excited to see that LMM can be run. I am using the guide here:

https://hail.is/docs/devel/stats/hail.stats.LinearMixedModel.html?highlight=kinship#hail.stats.LinearMixedModel.from_kinship.

I used:

model, p = LinearMixedModel.from_mixed_effects(y, x, z)

In this case, z was the output from:
z=hl.genetic_relatedness_matrix(mt.GT)

Is this the correct usage? The error message indicated that my z had more than 46k entries, but your description in the above post sounded like the 46k cutoff was only if converting to numpy. I am also unclear when to use .from_mixed_effects vs. .from_kinship and whether .genetic_relatedness_matrix is the correct way to build z/k or if another method is preferred. Thanks for your help.


#3

Hi Liz,

I think the higher-level commands linear_mixed_model and linear_mixed_regression will help clarify the interface. These are the simplest way to create a LinearMixedModel, fit the ratio of variance parameters, and do a GWAS. The commands from_kinship and from_mixed_effects are lower level constructors of LinearMixedMode, and called by linear_mixed_model depending on whether k or z_t is passed in. Creating a LinearMixedModel object directly is useful if you’re running regression for many phenotypes or sets of covariates (as in conditional analysis for fine mapping signal). That’s discussed in the docs for linear_mixed_regression.

There is a 46k cutoff for starting from a pre-computed kinship (which could be any positive semi-definite matrix). If you have more than 46k samples, then you’ll need to use fewer than 46k markers and pass in z_t rather than k. By default this corresponds to using the realized relationship matrix, RRM, though if you normalize variants yourself using expected variance under HWE rather than empirical variance you can get the effect of using the GRM. The restriction that both dimensions not exceed 46k is related to the max size of a Java array, since we localize either the kinship matrix or the linkage disequilibrium matrix and then take its eigendecomposition. We hope to engineer this restriction away in the future as well as add distributed eigendecomposition.

If you tell me how many samples you have, how many markers you’d like to use, and how many phenotypes you have on these samples, I may be able to recommend a strategy.


#4

Thanks so much for your fast response. The original post indicated that linear_mixed_regression is no longer a function, so I thought that the only way for me to do this type of analysis anymore was via LinearMixedModel. Are you saying I can still use linear_mixed_regression and linear_mixed_model if I call it in some way other than hl.linear_mixed_regression?

Thanks also for the clarification on the 46k sample | variant cutoff. To be clear, this does mean that if I have more than 46k people, then I can only do a gwas on 46k or fewer variants, correct? I was used to using FastLMM in the past and setting up a smaller list of LD-pruned variants for the relationship matrix, but it sounds like here the requirement is to use the complete set of variants I want to analyze in the gwas and have them included in the z_t.

I am currently only doing one phenotype but am using the UKBB data of >400k people. Thank you so much for any suggestions you have on how best to execute this.


#5

Sorry for the confusion: we did remove linear_mixed_regression months ago, and then recently added a command with the same name with a totally different API and implementation.

You can GWAS as many variants as you like with linear_mixed_regression. The restriction is on the number of samples or marker variants when defining the underlying linear_mixed_model. The latter command only uses the markers (z_t), which is why there’s a filter_rows in the example.

We’ve basically divided FastLMM into three steps:

  1. define the model using samples and either markers or kinship matrix
  2. fit the (global) model
  3. use the fit global model to perform a GWAS

Yes, you’d want to prune to a smaller list, biased toward more common variants. There is a data science question as to whether 46k variants will fully control for all inflation when using 400K+ samples, which we have yet to rigorously explore. We’re also interested in running this on UKBB to take advantage of savings over many phenotypes (after the work of doing the first one, the rest should run basically as fast as linear regression), but so far Hail’s new implementation has mainly been run on, say, 30k exomes. I’d recommend experimenting with GWAS on a subset of the genome before running all variants.


#6

ah, thank you, this is so incredibly helpful. it turns out we hadn’t updated our hail version in a bit, so we didn’t have this command available at all. we will get the latest version.

your description all makes sense, and I just want to make sure I have the commands right for the steps:

first, I make an LD-pruned matrix table with <46k variants and however many samples I want, and I use hl.realized_relationship_matrix to make the RRM (a.k.a., z_t).

Then I define my model using my phenotype, covariates, and that z_t.

then I fit the model, and then I use that fit global model to do a GWAS on however many samples and variants I want.

sound right?
thank you again!!


#7

Close! The point is, with more samples than markers, we want to avoid constructing the RRM at all (it would be 400k by 400k and we certainly aren’t going to take its eigendecomposition). The solution is to not use realized_relationship_matrix at all, but just pass in z_t = mt.dosage() or z_t = mt.GT.n_alt_alleles() or what have you on the filtered matrix table (and I recommend repartitioning after massively filtering before calling linear_mixed_model). These markers ARE the random effects in the mixed model (see the top level docs in the LinearMixedModel class), and under the hood from_random_effects will compute the correlation matrix between the random effects (the linkage disequilibrium matrix) and convert its eigenvectors to the the top 46k eigenvectors of the RRM, which are all the interesting eigenvectors since the rank of the RRM is bounded by the number of markers.


#8

oh! wonderful, this makes perfect sense. thank you again!!


#9

Hi, thanks again for your input. I got LMM to work in Hail with a smaller dataset but have hit an error when working with the full UK biobank dataset. Here is the command and error:

model, _ =hl.linear_mixed_model(y=mtn_filter1.Height1,x=[1, mtn_filter1.YOB, mtn_filter1.is_female],z_t=mtn_filter1.GT.n_alt_alleles(),p_path=‘output2’)

ValueError: incompatible shapes for matrix multiplication: (13373, 426047) and (1, 426047)

Is this error expected? It is 13373 variants and 426047 people.


#10

Hmm, looks like a transpose issue. Can you add the full stack trace so I can see what line is throwing the error?


#11

ValueError Traceback (most recent call last)
in ()
----> 1 model, _ =hl.linear_mixed_model(y=mtn_filter1.Height1,x=[1, mtn_filter1.YOB, mtn_filter1.is_female],z_t=mtn_filter1.GT.n_alt_alleles(),p_path=‘output2’)

in linear_mixed_model(y, x, z_t, k, p_path, overwrite, standardize, mean_impute)

/efs/home/hail/hail/python/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
558 def wrapper(original_func, *args, **kwargs):
559 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
–> 560 return original_func(*args, **kwargs)
561
562 return wrapper

/efs/home/hail/hail/python/hail/methods/statgen.py in linear_mixed_model(y, x, z_t, k, p_path, overwrite, standardize, mean_impute)
795 normalize=standardize).T # variance is 1 / n
796 m = z_bm.shape[1]
–> 797 model, p = LinearMixedModel.from_random_effects(y_nd, x_nd, z_bm, p_path, overwrite)
798 if standardize:
799 model.s = model.s * (n / m) # now variance is 1 / m

in from_random_effects(cls, y, x, z, p_path, overwrite, max_condition_number, complexity_bound)

/efs/home/hail/hail/python/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
558 def wrapper(original_func, *args, **kwargs):
559 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
–> 560 return original_func(*args, **kwargs)
561
562 return wrapper

/efs/home/hail/hail/python/hail/stats/linear_mixed_model.py in from_random_effects(cls, y, x, z, p_path, overwrite, max_condition_number, complexity_bound)
1106 BlockMatrix.from_numpy§.write(p_path, overwrite=overwrite)
1107 if p_is_bm:
-> 1108 py, px = (p @ y).to_numpy(), (p @ x).to_numpy()
1109 else:
1110 py, px = p @ y, p @ x

in matmul(self, b)

/efs/home/hail/hail/python/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
558 def wrapper(original_func, *args, **kwargs):
559 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
–> 560 return original_func(*args, **kwargs)
561
562 return wrapper

/efs/home/hail/hail/python/hail/linalg/blockmatrix.py in matmul(self, b)
1440 “”"
1441 if isinstance(b, np.ndarray):
-> 1442 return self @ BlockMatrix.from_numpy(b, self.block_size)
1443 else:
1444 if self.n_cols != b.n_rows:

in matmul(self, b)

/efs/home/hail/hail/python/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
558 def wrapper(original_func, *args, **kwargs):
559 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
–> 560 return original_func(*args, **kwargs)
561
562 return wrapper

/efs/home/hail/hail/python/hail/linalg/blockmatrix.py in matmul(self, b)
1443 else:
1444 if self.n_cols != b.n_rows:
-> 1445 raise ValueError(f’incompatible shapes for matrix multiplication: {self.shape} and {b.shape}’)
1446 return BlockMatrix(self._jbm.dot(b._jbm))
1447

ValueError: incompatible shapes for matrix multiplication: (13373, 426047) and (1, 426047)


#12

Thanks for finding this bug and your patience, it’s now fixed!

(the issue was that the 1d NumPy array y was being promoted to a row vector rather than a column vector)


#13

oh lovely, thanks! I can’t wait to give it a try.


#14

so… I’m back. I have a new error. this command worked in a smaller dataset but not in the full big ukbb dataset. full stack trace below:

2018-10-22 17:54:59 Hail: INFO: Wrote all 420 blocks of 13373 x 426047 matrix with block size 4096.
2018-10-22 20:38:59 Hail: INFO: wrote 420 items in 420 partitions

ValueError Traceback (most recent call last)
in ()
1 #test on simple model
2 #testing on a smaller dataset worked, but this doesn’t
----> 3 model, _ =hl.linear_mixed_model(y=mtn_filter1.Height1,x=[1, mtn_filter1.YOB, mtn_filter1.is_female],z_t=mtn_filter1.GT.n_alt_alleles(),p_path=‘s3://output5’)

in linear_mixed_model(y, x, z_t, k, p_path, overwrite, standardize, mean_impute)

Code/Hail-devel-90a5cab4aab8/python/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
558 def wrapper(original_func, *args, **kwargs):
559 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
–> 560 return original_func(*args, **kwargs)
561
562 return wrapper

Code/Hail-devel-90a5cab4aab8/python/hail/methods/statgen.py in linear_mixed_model(y, x, z_t, k, p_path, overwrite, standardize, mean_impute)
874 normalize=standardize).T # variance is 1 / n
875 m = z_bm.shape[1]
–> 876 model, p = LinearMixedModel.from_random_effects(y_nd, x_nd, z_bm, p_path, overwrite)
877 if standardize:
878 model.s = model.s * (n / m) # now variance is 1 / m

in from_random_effects(cls, y, x, z, p_path, overwrite, max_condition_number, complexity_bound)

Code/Hail-devel-90a5cab4aab8/python/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
558 def wrapper(original_func, *args, **kwargs):
559 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
–> 560 return original_func(*args, **kwargs)
561
562 return wrapper

Code/Hail-devel-90a5cab4aab8/python/hail/stats/linear_mixed_model.py in from_random_effects(cls, y, x, z, p_path, overwrite, max_condition_number, complexity_bound)
1111
1112 if low_rank:
-> 1113 model = LinearMixedModel(py, px, s, y, x, p_path)
1114 else:
1115 model = LinearMixedModel(py, px, s, p_path=p_path)

in init(self, py, px, s, y, x, p_path)

Code/Hail-devel-90a5cab4aab8/python/hail/typecheck/check.py in wrapper(__original_func, *args, **kwargs)
558 def wrapper(original_func, *args, **kwargs):
559 args
, kwargs
= check_all(__original_func, args, kwargs, checkers, is_method=is_method)
–> 560 return original_func(*args, **kwargs)
561
562 return wrapper

Code/Hail-devel-90a5cab4aab8/python/hail/stats/linear_mixed_model.py in init(self, py, px, s, y, x, p_path)
289 raise ValueError(‘for low-rank, set both y and x; for full-rank, do not set y or x.’)
290
–> 291 _check_dims(py, ‘py’, 1)
292 _check_dims(px, ‘px’, 2)
293 _check_dims(s, ‘s’, 1)

Code/Hail-devel-90a5cab4aab8/python/hail/linalg/utils/misc.py in _check_dims(a, name, ndim, min_size)
231 def _check_dims(a, name, ndim, min_size=1):
232 if len(a.shape) != ndim:
–> 233 raise ValueError(f’{name} must be {ndim}-dimensional, ’
234 f’found {a.ndim}’)
235 for i in range(ndim):

ValueError: py must be 1-dimensional, found 2


#15

can we move this to a separate thread? These update posts should really be singleton posts.

Just copy the last post into a new topic in #help-discuss


#16

Oh goodness, I’ll fix this today. (don’t bother moving the post, the error is related, apologies for the inconvenience)


#17

github is having some issues, but once this is in you should be all set: https://github.com/hail-is/hail/pull/4606


#18

update: it’s fixed.