Question regarding threshold for hail imputation/"No complete samples" error

So I am trying to run a GWAS pipeline on a large number of phenotypes (~1300) for a specific chromosome. I am facing the given error:-

FatalError: HailException: No complete samples: each sample is missing its phenotype or some covariate

Java stack trace:
is.hail.utils.HailException: No complete samples: each sample is missing its phenotype or some covariate

This gives me the impression that some of the ~1300 phenotypes have complete NA values in the entire field. I performed some filters and found that while none of them 100% NA values, there are quite a few that have >99% NA values. My question is, if my line of thinking is right, there is a threshold at which hail fails to impute values for a field. Earlier I was under the assumption that the threshold might be 100% NA values, but since I couldn’t find any, I want to know if there is such a threshold at which hail fails to impute values for a field. If my understanding of the error is wrong, please feel free to correct me and let me know what the issue is.

Hi @cobalt !

Hail does not impute values. We made an intentional & pervasive design choice to require our users to be explicit about things like imputation, filtering, or centering.

I suspect you’ll find that all of your samples are missing at least one of the phenotype or any covariate. You can prove this to yourself with this Hail query:

mt = mt.annotate_cols(
    can_be_used_for_regression = (
        hl.is_defined(mt.phenotype) &
        hl.is_defined(mt.covariate_1) &
        hl.is_defined(mt.covariate_2) &
        ...
    )
)
mt.can_be_used_for_regression.show()

You might also try inspecting these fields directly, for example, by collecting them into a local Pandas DataFrame and looking at that.

cols = mt.cols().select('phenotype', 'covariate_1', 'covariate_2', ...)
cols = cols.to_pandas()
cols

Ah, and I forgot to mention, if you want to mean impute the missing values, you can try something like this:

means = mt.aggregate_cols(hl.struct(
    mean_phenotype = hl.agg.mean(mt.phenotype),
    mean_covariate_1 = hl.agg.mean(mt.covariate_1),
    ...
))
mt = mt.annotate_cols(
    phenotype = hl.coalesce(mt.phenotype, means.mean_phenotype),
    ...
)

Hi @danking , thank you for your precise and informative answers. I have some followup questions I would like to ask. Firstly, I was under the impression that the linear_regression_rows() function performed imputation of missing values. Is this assumption wrong? Secondly, the code you provided works given mt.phenotype is a float or float expression, but if my mt.phenotype is a structure of a collection of float expressions, it fails. Do you have any advice regarding this dilemma? Again, thank you for the reply, appreciate it.

Hey @cobalt !

Do you recall where you got that impression? I’d like to update the documentation to be more clear! The method linear_regression_rows will remove incomplete samples (that leads to the error you see above), but Hail will never modify the covariates or independent variables.

Do you mean that you have a Python list of response variables? For example,

mt = mt.linear_regression_rows(
    y=[mt.pheno1, mt.pheno2, ...],
    ...
)

You can create an array of means like this:

phenos = ['pheno1', 'pheno2', ...]
mt = mt.annotate_cols(**{
    p + '_mean': hl.agg.mean(mt[p]) for p in phenos
})

This uses the Python ** syntax to create lots of new fields programmatically. It also uses the mt[string] syntax which lets you use the string name of a field to refer to it. These two expressions refer to the same field:

mt.field_one
mt['field_one']

You can then mean-impute all your phenotypes like this:

mt = mt.annotate_cols(**{
    p: hl.coalesce(mt[p], mt[p + '_mean']) for p in phenos
})

Hi @danking , you stated earlier in your first reply that " you’ll find that all of your samples are missing at least one of the phenotype or any covariate". I understand why this would cause an issue for the linear regression rows function. You also mention that no imputation is performed unless explicitly specified. But, when I filter my ~1000 phenotypes based on percentage of missing values, I am able to run the linear regression rows on phenotypes that still have <=60% missing values. Doesn’t this contradict the fact the linear regression rows will fail if any sample has at least 1 missing phenotype? I did check through the table and there are many instances where a sample has more than 1 missing phenotype, but the linear regression rows still works. I guess I am just confused because of this contradiction. Any clarification would be greatly appreciated.

It has nothing to do with how one sample varies across all its phenotypes. It also has nothing to do with the percent of missing values.

Construct a matrix whose columns are samples and whose rows are: a particular phenotype, the first covariate, the second covariate, …, the last covariate. You can’t do a linear regression unless there are at least one column with no missing values.