Search code examples
pythonpandasiterable-unpacking

extracting a list within a list in a tuple which happens to be in a pd.series


x=     
[[(some text,[a]), (some text,[b]), (some text,[c]).........]]
   [[(some text,[d]), (some text,[e]), (some text,[f]).........]]
    [[(some text,[g]), (some text,[h]), (some text,[k]).........]]
    [[(some text,[i]), (some text,[x]), (some text,[y]).........]]
    [[(some text,[z]), (some text,[t]), (some text,[w]).........]]
    [[(some text,[t]), (some text,[g]), (some text,[u]).........]]

type(x)

pandas.core.series.Series

I want to create a series that only contains the values of the list within the tuple such as those[a] or [u] or [w].

How can I extract? Thank you.

UPDATE: I realized the way I phrase the question was confusing. I changed it now. It represents my problem better. Basically, I need to extract all [a] or [u] or [w]row by row. This is tokenized text data, they are words in sentences. Sorry for the confusion.


Solution

  • Given Series s,

    s = pd.Series(x)
    

    we can first get take the first elements out (since each row is a nested list), explode it and use the str accessor to get the second elements in each tuple; then take the elements out from singleton lists to get the raw data. Then groupby the index, and join the tokens.

    out = s.str[0].explode().str[1].str[0].groupby(level=0).apply(','.join)
    

    Output:

    0    a,b,c
    1    d,e,f
    2    g,h,k
    3    i,x,y
    4    z,t,w
    5    t,g,u