x=
[[(some text,[a]), (some text,[b]), (some text,[c]).........]]
[[(some text,[d]), (some text,[e]), (some text,[f]).........]]
[[(some text,[g]), (some text,[h]), (some text,[k]).........]]
[[(some text,[i]), (some text,[x]), (some text,[y]).........]]
[[(some text,[z]), (some text,[t]), (some text,[w]).........]]
[[(some text,[t]), (some text,[g]), (some text,[u]).........]]
type(x)
pandas.core.series.Series
I want to create a series that only contains the values of the list within the tuple such as those[a]
or [u]
or [w]
.
How can I extract? Thank you.
UPDATE: I realized the way I phrase the question was confusing. I changed it now. It represents my problem better. Basically, I need to extract all [a]
or [u]
or [w]
row by row. This is tokenized text data, they are words in sentences. Sorry for the confusion.
Given Series s
,
s = pd.Series(x)
we can first get take the first elements out (since each row is a nested list), explode
it and use the str
accessor to get the second elements in each tuple; then take the elements out from singleton lists to get the raw data. Then groupby
the index, and join
the tokens.
out = s.str[0].explode().str[1].str[0].groupby(level=0).apply(','.join)
Output:
0 a,b,c
1 d,e,f
2 g,h,k
3 i,x,y
4 z,t,w
5 t,g,u