Hello world,
I have just started getting my feet wet in Hail, and already have stumbled upon an issue I need help with.
I want to select a subset of samples from my hail matrixTable
, from a list of sample IDs.
I tried some solutions mentioned in previous posts, but I keep getting an error.
I tried running:
filtered_mt = mt.filter_cols(hl.array(selectedIndividuals['ID']).contains(mt.s))
This gives me an error that says:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I’m sure I’m missing something quite simple here. Can anyone in the Hail community please help me out with this?
Thanks in advance,
Joy
hi joy,
i think i need a little more information to help out here. what does selectedIndividuals
look like? feel free to replace the actual data with example data; i just need to see what the type and structure of it is.
–iris
Hi @iris-garden ,
Thanks for your reply. selectedIndividuals
is a pandas dataframe, with the sample identifiers in the “ID” column.
I want to subset the matrixTable mt with only the samples listed in the ‘ID’ column.
Thanks again for the help. Let me know if any other details might help.
Best,
Joydeep
okay, looks like the issue is that we’re not able to directly convert a pandas DataFrame
column into a hail array
. one option would be to convert the column to a Python list
first:
filtered_mt = mt.filter_cols(
hl.array(selectedIndividuals['ID'].tolist()).contains(mt.s)
)
but if your DataFrame
has a lot of values, that might end up being slow, so instead you could convert the DataFrame
to a hail Table
, and filter using that instead:
# convert the whole DataFrame:
ht = hl.Table.from_pandas(selectedIndividuals)
# or convert just the ID column, if you'd prefer:
ids = pd.DataFrame
ids["ID"] = selectedIndividuals["ID"]
ht = hl.Table.from_pandas(ids)
# filter using the Table we just created
mt = mt.filter_cols(hl.is_defined(ht[mt.s]))
hope that helps!