Selecting samples from a list of IDs from a matrixTable

Hello world,

I have just started getting my feet wet in Hail, and already have stumbled upon an issue I need help with.
I want to select a subset of samples from my hail matrixTable, from a list of sample IDs.
I tried some solutions mentioned in previous posts, but I keep getting an error.
I tried running:

filtered_mt = mt.filter_cols(hl.array(selectedIndividuals['ID']).contains(mt.s))

This gives me an error that says:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I’m sure I’m missing something quite simple here. Can anyone in the Hail community please help me out with this?

Thanks in advance,
Joy

hi joy,

i think i need a little more information to help out here. what does selectedIndividuals look like? feel free to replace the actual data with example data; i just need to see what the type and structure of it is.

–iris

Hi @iris-garden ,
Thanks for your reply. selectedIndividuals is a pandas dataframe, with the sample identifiers in the “ID” column.
image
I want to subset the matrixTable mt with only the samples listed in the ‘ID’ column.

Thanks again for the help. Let me know if any other details might help.

Best,
Joydeep

okay, looks like the issue is that we’re not able to directly convert a pandas DataFrame column into a hail array. one option would be to convert the column to a Python list first:

filtered_mt = mt.filter_cols(
    hl.array(selectedIndividuals['ID'].tolist()).contains(mt.s)
)

but if your DataFrame has a lot of values, that might end up being slow, so instead you could convert the DataFrame to a hail Table, and filter using that instead:

# convert the whole DataFrame:
ht = hl.Table.from_pandas(selectedIndividuals)

# or convert just the ID column, if you'd prefer:
ids = pd.DataFrame
ids["ID"] = selectedIndividuals["ID"]
ht = hl.Table.from_pandas(ids)

# filter using the Table we just created
mt = mt.filter_cols(hl.is_defined(ht[mt.s]))

hope that helps!