Search code examples
pythonpandasdataframetuplesapply

Create new column from existing one with more values in it


I have column with following values:

d = {'id': [1, 2, 3, 4, 5],
     'value': [['Red', 'Blue', 'Yellow'],
               ['Blue', 'Yellow', 'Orange'],
               ['Green', 'Purple', 'Yellow', 'Red'],
               ['Violet', 'Blue', 'Green', 'Red', 'Brown'],
               ['Blue', 'Green']]}

df = pd.DataFrame(data = d)

enter image description here

And I want to break down column values, tuples of strings, into pairs to form a new column or list like that

d = {'value': [['Red', 'Blue'],
               ['Blue', 'Yellow'],
               ['Blue', 'Yellow'],
               ['Yellow', 'Orange'],
               ['Green', 'Purple'],
               ['Purple', 'Yellow'],
               ['Yellow', 'Red'],
               ['Violet', 'Blue'],
               ['Blue', 'Green'],
               ['Green', 'Red'],
               ['Red', 'Brown'],
               ['Blue', 'Green']]}

df = pd.DataFrame(data = d)

enter image description here

I do the breaking with apply(lambda x:) function, however it returns only one pair of values.

def splitter(row):
    for first, second in zip(row, row[1:]):
        return [first, second]

pairs_list = df_gr.status.apply(lambda x: splitter(x))

I know that it can be with iterrows() loop but I'd like to know a more efficient method.


Solution

  • Use list comprehension with window function and pass to DataFrame constructor:

    from itertools import islice
    
    #https://stackoverflow.com/a/6822773/2901002
    def window(seq, n=2):
        "Returns a sliding window (of width n) over data from the iterable"
        "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
        it = iter(seq)
        result = tuple(islice(it, n))
        if len(result) == n:
            yield result
        for elem in it:
            result = result[1:] + (elem,)
            yield result
            
    df = pd.DataFrame({'new': [list(y) for x in df['value'] for y in window(x)]})
    print (df)
                     new
    0        [Red, Blue]
    1     [Blue, Yellow]
    2     [Blue, Yellow]
    3   [Yellow, Orange]
    4    [Green, Purple]
    5   [Purple, Yellow]
    6      [Yellow, Red]
    7     [Violet, Blue]
    8      [Blue, Green]
    9       [Green, Red]
    10      [Red, Brown]
    11     [Blue, Green]
    

    Or simplier modify another solution (because working with nested lists):

    window_size = 2
    
    #https://stackoverflow.com/a/6822773/2901002
    df = pd.DataFrame({'new': [x[i: i + window_size] for x in df['value'] 
                               for i in range(len(x) - window_size + 1)]})
    print (df)
                     new
    0        [Red, Blue]
    1     [Blue, Yellow]
    2     [Blue, Yellow]
    3   [Yellow, Orange]
    4    [Green, Purple]
    5   [Purple, Yellow]
    6      [Yellow, Red]
    7     [Violet, Blue]
    8      [Blue, Green]
    9       [Green, Red]
    10      [Red, Brown]
    11     [Blue, Green]