I have just started getting my feet wet in Hail, and already have stumbled upon an issue I need help with.
I want to select a subset of samples from my hail
matrixTable, from a list of sample IDs.
I tried some solutions mentioned in previous posts, but I keep getting an error.
I tried running:
filtered_mt = mt.filter_cols(hl.array(selectedIndividuals['ID']).contains(mt.s))
This gives me an error that says:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I’m sure I’m missing something quite simple here. Can anyone in the Hail community please help me out with this?
Thanks in advance,
i think i need a little more information to help out here. what does
selectedIndividuals look like? feel free to replace the actual data with example data; i just need to see what the type and structure of it is.
Hi @iris-garden ,
Thanks for your reply.
selectedIndividuals is a pandas dataframe, with the sample identifiers in the “ID” column.
I want to subset the matrixTable mt with only the samples listed in the ‘ID’ column.
Thanks again for the help. Let me know if any other details might help.
okay, looks like the issue is that we’re not able to directly convert a pandas
DataFrame column into a hail
array. one option would be to convert the column to a Python
filtered_mt = mt.filter_cols(
but if your
DataFrame has a lot of values, that might end up being slow, so instead you could convert the
DataFrame to a hail
Table, and filter using that instead:
# convert the whole DataFrame:
ht = hl.Table.from_pandas(selectedIndividuals)
# or convert just the ID column, if you'd prefer:
ids = pd.DataFrame
ids["ID"] = selectedIndividuals["ID"]
ht = hl.Table.from_pandas(ids)
# filter using the Table we just created
mt = mt.filter_cols(hl.is_defined(ht[mt.s]))
hope that helps!