Manipulate columns from the first occurrences table from the UK Biobank in hail

ksbarrett · October 21, 2020, 4:46pm

I’m attempting to manipulate columns from the first occurrences table from the UK Biobank in hail. The table is a collection of columns (length n= number of patients) with a three letter ICD code header. Each row value is either an empty string (“”) or the date of first occurrence (eg, 2002-10-1) of the three letter ICD code from each participant’s medical data. There is also a column of patient identifiers (to match rows with patients).

I am hoping to replace the dates in each row with the ICD code header for the column (and then gather the defined values for each patient). This seems to require iterating over each column, and I can’t figure out how to accomplish this using a hail table/ hail syntax.

This blog entry asks and answers a similar question using python: https://stackoverflow.com/questions/37032043/how-to-replace-a-value-in-a-pandas-dataframe-with-column-name-based-on-a-conditi

Some things I’ve tried (with table described above loaded at fo_table):

Option #1: (ExpressionException: Cannot index with a scalar expression)

#grab list of columns keys
fo_table_cols= dict(fo_table.row_value).keys()

fo_table2=fo_table.annotate(icd=hl.set(hl.array(list(fo_table_cols)).map(lambda x: hl.if_else(hl.is_defined(fo_table[x]), x, hl.null(hl.tstr)))))

Option #2: (ExpressionException: Hail cannot automatically impute type of <class ‘collections.abc.KeysView’>)

fo_table2=fo_table.annotate(icds_to_keep=hl.array(fo_table.row.keys()).map(lambda k: hl.if_else(fo_table.row()[k]!="", k, “”)))

Thanks in advance,

Kelly

johnc1231 · October 21, 2020, 4:53pm

Alright, so I think the rough structure of what you want I think is:

all_field_names = [ field_name for field_name in ht.row]
# Now you want to make sure the field names list contains only the fields you want to do this for, you can add some condition in place of the ... to test for that.
field_names = [field_name for field_name in all_field_names if .....]

ht = ht.annotate(**{field_name: hl.if_else(ht[field_name] != "", field_name, "") for field_name in field_names})

The Python ** syntax is a little tricky if you’ve never seen it before, it just passes a python dictionary as kwargs.

EDIT: Added closing paren

konradjk · October 21, 2020, 4:55pm

Also, in case it’s helpful, I’ve parsed this file into a Hail MatrixTable before: https://github.com/Nealelab/ukb_common/blob/master/utils/phenotype_loading.py#L613 - the code isn’t super pretty, but it gets the job done.

ksbarrett · October 21, 2020, 5:11pm

That did it! (just missing a closing paren on the last command…)

Thanks for the help and introduction to the python ** syntax.

Topic		Replies	Views
How to add new columns in Hail table? Hail Query & hailctl	4	680	March 12, 2024
Converting `MatrixTable` entry to column field Hail Query & hailctl	1	361	September 8, 2021
Change data type of hail table column Hail Query & hailctl	3	1360	January 16, 2020
How to get index of rows(pandas dataframe row analog) Hail Query & hailctl	3	482	December 6, 2021
Way to check if value is NA and replace it Hail Query & hailctl	3	1115	September 11, 2020

Manipulate columns from the first occurrences table from the UK Biobank in hail

Related topics