Question regarding threshold for hail imputation/"No complete samples" error

cobalt · September 26, 2022, 7:46am

So I am trying to run a GWAS pipeline on a large number of phenotypes (~1300) for a specific chromosome. I am facing the given error:-

FatalError: HailException: No complete samples: each sample is missing its phenotype or some covariate

Java stack trace:
is.hail.utils.HailException: No complete samples: each sample is missing its phenotype or some covariate

This gives me the impression that some of the ~1300 phenotypes have complete NA values in the entire field. I performed some filters and found that while none of them 100% NA values, there are quite a few that have >99% NA values. My question is, if my line of thinking is right, there is a threshold at which hail fails to impute values for a field. Earlier I was under the assumption that the threshold might be 100% NA values, but since I couldn’t find any, I want to know if there is such a threshold at which hail fails to impute values for a field. If my understanding of the error is wrong, please feel free to correct me and let me know what the issue is.

danking · September 26, 2022, 2:11pm

Hi @cobalt !

Hail does not impute values. We made an intentional & pervasive design choice to require our users to be explicit about things like imputation, filtering, or centering.

I suspect you’ll find that all of your samples are missing at least one of the phenotype or any covariate. You can prove this to yourself with this Hail query:

mt = mt.annotate_cols(
    can_be_used_for_regression = (
        hl.is_defined(mt.phenotype) &
        hl.is_defined(mt.covariate_1) &
        hl.is_defined(mt.covariate_2) &
        ...
    )
)
mt.can_be_used_for_regression.show()

You might also try inspecting these fields directly, for example, by collecting them into a local Pandas DataFrame and looking at that.

cols = mt.cols().select('phenotype', 'covariate_1', 'covariate_2', ...)
cols = cols.to_pandas()
cols

Ah, and I forgot to mention, if you want to mean impute the missing values, you can try something like this:

means = mt.aggregate_cols(hl.struct(
    mean_phenotype = hl.agg.mean(mt.phenotype),
    mean_covariate_1 = hl.agg.mean(mt.covariate_1),
    ...
))
mt = mt.annotate_cols(
    phenotype = hl.coalesce(mt.phenotype, means.mean_phenotype),
    ...
)

cobalt · September 28, 2022, 1:27pm

Hi @danking , thank you for your precise and informative answers. I have some followup questions I would like to ask. Firstly, I was under the impression that the linear_regression_rows() function performed imputation of missing values. Is this assumption wrong? Secondly, the code you provided works given mt.phenotype is a float or float expression, but if my mt.phenotype is a structure of a collection of float expressions, it fails. Do you have any advice regarding this dilemma? Again, thank you for the reply, appreciate it.

danking · September 28, 2022, 4:05pm

Hey @cobalt !

Do you recall where you got that impression? I’d like to update the documentation to be more clear! The method linear_regression_rows will remove incomplete samples (that leads to the error you see above), but Hail will never modify the covariates or independent variables.

Do you mean that you have a Python list of response variables? For example,

mt = mt.linear_regression_rows(
    y=[mt.pheno1, mt.pheno2, ...],
    ...
)

You can create an array of means like this:

phenos = ['pheno1', 'pheno2', ...]
mt = mt.annotate_cols(**{
    p + '_mean': hl.agg.mean(mt[p]) for p in phenos
})

This uses the Python ** syntax to create lots of new fields programmatically. It also uses the mt[string] syntax which lets you use the string name of a field to refer to it. These two expressions refer to the same field:

mt.field_one
mt['field_one']

You can then mean-impute all your phenotypes like this:

mt = mt.annotate_cols(**{
    p: hl.coalesce(mt[p], mt[p + '_mean']) for p in phenos
})

cobalt · October 19, 2022, 7:45am

Hi @danking , you stated earlier in your first reply that " you’ll find that all of your samples are missing at least one of the phenotype or any covariate". I understand why this would cause an issue for the linear regression rows function. You also mention that no imputation is performed unless explicitly specified. But, when I filter my ~1000 phenotypes based on percentage of missing values, I am able to run the linear regression rows on phenotypes that still have <=60% missing values. Doesn’t this contradict the fact the linear regression rows will fail if any sample has at least 1 missing phenotype? I did check through the table and there are many instances where a sample has more than 1 missing phenotype, but the linear regression rows still works. I guess I am just confused because of this contradiction. Any clarification would be greatly appreciated.

danking · October 19, 2022, 1:36pm

It has nothing to do with how one sample varies across all its phenotypes. It also has nothing to do with the percent of missing values.

Construct a matrix whose columns are samples and whose rows are: a particular phenotype, the first covariate, the second covariate, …, the last covariate. You can’t do a linear regression unless there are at least one column with no missing values.

Topic		Replies	Views
Error summary: HailException Hail Query & hailctl	4	546	September 26, 2022
Missing value and logistic regression Hail Query & hailctl	5	790	October 2, 2020
Empty logistic regression Help [0.1]	16	1176	November 2, 2017
Ht.write() throw NumberFormatException when I handled missing value Hail Query & hailctl	15	510	June 25, 2021
Wrong summary statistic file Hail Query & hailctl	2	182	November 19, 2023

Question regarding threshold for hail imputation/"No complete samples" error

Related topics