Search code examples
pythonpandasdataframemelt

Pair each item with the last non-null value in a row


I'm trying to make a function where I feed it a list of URLs which go through a 301 hop and it flattens it for me. I want to save the resulting list as a CSV so I can hand it to the developers who can implement it and get rid of 301 hops.

For example, my crawler will produce this list of 301 hops:

    URL1          | URL2              | URL3              | URL4
example.com/url1  | example.com/url2  |                   | 
example.com/url3  | example.com/url4  | example.com/url5  | 
example.com/url6  | example.com/url7  | example.com/url8  | example.com/10
example.com/url9  | example.com/url7  | example.com/url8  | 
example.com/url23 | example.com/url10 |                   | 
example.com/url24 | example.com/url45 | example.com/url46 | 
example.com/url25 | example.com/url45 | example.com/url46 | 
example.com/url26 | example.com/url45 | example.com/url46 | 
example.com/url27 | example.com/url45 | example.com/url46 | 
example.com/url28 | example.com/url45 | example.com/url46 | 
example.com/url29 | example.com/url45 | example.com/url46 | 
example.com/url30 | example.com/url45 | example.com/url46 | 

The output I'm trying to get is

URL1              | URL2 
example.com/url1  | example.com/url2
example.com/url3  | example.com/url5
example.com/url4  | example.com/url5
example.com/url6  | example.com/10
example.com/url7  | example.com/10
example.com/url8  | example.com/10
example.com/url23 | example.com/url10
...

I've converted the Pandas dataframe to a list of lists using the below code:

import pandas as pd
import numpy as np

csv1 = pd.read_csv('Example_301_sheet.csv', header=None)
outlist = []

def link_flat(csv):

    for row in csv.iterrows():
        index, data = row
        outlist.append(data.tolist())

    return outlist

This returns each row as a list, and they are all nested together in a list, like below:

[['example.com/url1', 'example.com/url2', nan, nan],
 ['example.com/url3', 'example.com/url4', 'example.com/url5', nan],
 ['example.com/url6',
  'example.com/url7',
  'example.com/url8',
  'example.com/10'],
 ['example.com/url9', 'example.com/url7', 'example.com/url8', nan],
 ['example.com/url23', 'example.com/url10', nan, nan],
 ['example.com/url24', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url25', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url26', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url27', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url28', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url29', 'example.com/url45', 'example.com/url46', nan],
 ['example.com/url30', 'example.com/url45', 'example.com/url46', nan]]

How do I match each URL in each nested list with the last URL in the same list to produce the above list?


Solution

  • You'll need to determine the last valid item per row using groupby + last, and then reshape your dataFrame and build a two-column mapping using melt.

    df.columns = range(len(df.columns))
    df = (
        df.assign(URL2=df.stack().groupby(level=0).last())
          .melt('URL2', value_name='URL1')  
          .drop('variable', 1)
          .dropna()
          .drop_duplicates()
          .query('URL1 != URL2')
          .sort_index(axis=1)
          .reset_index(drop=True)
    )
    

    df
                     URL1               URL2
    0    example.com/url1   example.com/url2
    1    example.com/url3   example.com/url5
    2    example.com/url6     example.com/10
    3    example.com/url9   example.com/url8
    4   example.com/url23  example.com/url10
    5   example.com/url24  example.com/url46
    6   example.com/url25  example.com/url46
    7   example.com/url26  example.com/url46
    8   example.com/url27  example.com/url46
    9   example.com/url28  example.com/url46
    10  example.com/url29  example.com/url46
    11  example.com/url30  example.com/url46
    12   example.com/url4   example.com/url5
    13   example.com/url7     example.com/10
    14   example.com/url7   example.com/url8
    15  example.com/url45  example.com/url46
    16   example.com/url8     example.com/10