Search code examples
pythonpandasdata-cleaning

Turning sets of adjacent cells in the same row into a new row in Pandas


After scraping data from an unordered list on a web page, I ended up with rows like this sample data. The first four columns look just like a want them to be. But then, item no. 2 (and all following items) of the same category are in the same row.

data = [['cateogory1', "cat1_item1", "foo", "http://example.com", "cat1_item2", "foo", "http//def.com"], ['cateogory2', "cat2_item1", "foo", "http://uvw.com", "cat2_item2", "foo", "http//xyz.com"]]

df = pd.DataFrame(data, columns=['category', 'wpb_wrapper 2',
       'wpb_wrapper 3', 'wpb_wrapper href', 'wpb_wrapper 4', 'wpb_wrapper 5',
       'wpb_wrapper href 2'])
df

enter image description here

Instead, each set of three cells should be a new row, starting with the same category value.

data = [['category1', "cat1_item1", "foo", "http://example.com"], ["category1", "cat1_item2", "foo", "http//def.com"], ['category2', "cat2_item1", "foo", "http://uvw.com"], ["category2", "cat2_item2", "foo", "http//xyz.com"]]

df4 = pd.DataFrame(data, columns=['category', 'wpb_wrapper 2',
       'wpb_wrapper 3', 'wpb_wrapper href'])
df4

enter image description here

How can split the row after each colum containing "href" in the column" and turn those 3 cells into a new row while preserving the inital value for category?


Solution

  • You could try as follows:

    import pandas as pd
    import numpy as np
    
    data = [['cateogory1', "cat1_item1", "foo", "http://example.com", "cat1_item2", "foo", "http//def.com"], ['cateogory2', "cat2_item1", "foo", "http://uvw.com", "cat2_item2", "foo", "http//xyz.com"]]
    
    df = pd.DataFrame(data, columns=['category', 'wpb_wrapper 2',
           'wpb_wrapper 3', 'wpb_wrapper href', 'wpb_wrapper 4', 'wpb_wrapper 5',
           'wpb_wrapper href 2'])
    
    df1 = df.iloc[:,0:4]
    df2 = df.iloc[:,np.r_[0, 4:7]]
    df2.columns = ['category', 'wpb_wrapper 2',
           'wpb_wrapper 3', 'wpb_wrapper href']
    
    df_new = pd.concat([df1,df2], axis=0)
    df_new.sort_values('category', inplace=True)
    df_new.reset_index(drop=True, inplace=True)
    print(df_new)
    
         category wpb_wrapper 2 wpb_wrapper 3    wpb_wrapper href
    0  cateogory1    cat1_item1           foo  http://example.com
    1  cateogory1    cat1_item2           foo       http//def.com
    2  cateogory2    cat2_item1           foo      http://uvw.com
    3  cateogory2    cat2_item2           foo       http//xyz.com