Manipulate columns from the first occurrences table from the UK Biobank in hail

I’m attempting to manipulate columns from the first occurrences table from the UK Biobank in hail. The table is a collection of columns (length n= number of patients) with a three letter ICD code header. Each row value is either an empty string (“”) or the date of first occurrence (eg, 2002-10-1) of the three letter ICD code from each participant’s medical data. There is also a column of patient identifiers (to match rows with patients).

I am hoping to replace the dates in each row with the ICD code header for the column (and then gather the defined values for each patient). This seems to require iterating over each column, and I can’t figure out how to accomplish this using a hail table/ hail syntax.

This blog entry asks and answers a similar question using python: https://stackoverflow.com/questions/37032043/how-to-replace-a-value-in-a-pandas-dataframe-with-column-name-based-on-a-conditi

Some things I’ve tried (with table described above loaded at fo_table):

Option #1: (ExpressionException: Cannot index with a scalar expression)

#grab list of columns keys
fo_table_cols= dict(fo_table.row_value).keys()

fo_table2=fo_table.annotate(icd=hl.set(hl.array(list(fo_table_cols)).map(lambda x: hl.if_else(hl.is_defined(fo_table[x]), x, hl.null(hl.tstr)))))

Option #2: (ExpressionException: Hail cannot automatically impute type of <class ‘collections.abc.KeysView’>)

fo_table2=fo_table.annotate(icds_to_keep=hl.array(fo_table.row.keys()).map(lambda k: hl.if_else(fo_table.row()[k]!="", k, “”)))

Thanks in advance,

Kelly

Alright, so I think the rough structure of what you want I think is:

all_field_names = [ field_name for field_name in ht.row]
# Now you want to make sure the field names list contains only the fields you want to do this for, you can add some condition in place of the ... to test for that.
field_names = [field_name for field_name in all_field_names if .....]

ht = ht.annotate(**{field_name: hl.if_else(ht[field_name] != "", field_name, "") for field_name in field_names})

The Python ** syntax is a little tricky if you’ve never seen it before, it just passes a python dictionary as kwargs.

EDIT: Added closing paren

Also, in case it’s helpful, I’ve parsed this file into a Hail MatrixTable before: https://github.com/Nealelab/ukb_common/blob/master/utils/phenotype_loading.py#L613 - the code isn’t super pretty, but it gets the job done.

1 Like

That did it! (just missing a closing paren on the last command…)

Thanks for the help and introduction to the python ** syntax.

1 Like