I'm trying to make a function where I feed it a list of URLs which go through a 301 hop and it flattens it for me. I want to save the resulting list as a CSV so I can hand it to the developers who can implement it and get rid of 301 hops.
For example, my crawler will produce this list of 301 hops:
URL1 | URL2 | URL3 | URL4
example.com/url1 | example.com/url2 | |
example.com/url3 | example.com/url4 | example.com/url5 |
example.com/url6 | example.com/url7 | example.com/url8 | example.com/10
example.com/url9 | example.com/url7 | example.com/url8 |
example.com/url23 | example.com/url10 | |
example.com/url24 | example.com/url45 | example.com/url46 |
example.com/url25 | example.com/url45 | example.com/url46 |
example.com/url26 | example.com/url45 | example.com/url46 |
example.com/url27 | example.com/url45 | example.com/url46 |
example.com/url28 | example.com/url45 | example.com/url46 |
example.com/url29 | example.com/url45 | example.com/url46 |
example.com/url30 | example.com/url45 | example.com/url46 |
The output I'm trying to get is
URL1 | URL2
example.com/url1 | example.com/url2
example.com/url3 | example.com/url5
example.com/url4 | example.com/url5
example.com/url6 | example.com/10
example.com/url7 | example.com/10
example.com/url8 | example.com/10
example.com/url23 | example.com/url10
...
I've converted the Pandas dataframe to a list of lists using the below code:
import pandas as pd
import numpy as np
csv1 = pd.read_csv('Example_301_sheet.csv', header=None)
outlist = []
def link_flat(csv):
for row in csv.iterrows():
index, data = row
outlist.append(data.tolist())
return outlist
This returns each row as a list, and they are all nested together in a list, like below:
[['example.com/url1', 'example.com/url2', nan, nan],
['example.com/url3', 'example.com/url4', 'example.com/url5', nan],
['example.com/url6',
'example.com/url7',
'example.com/url8',
'example.com/10'],
['example.com/url9', 'example.com/url7', 'example.com/url8', nan],
['example.com/url23', 'example.com/url10', nan, nan],
['example.com/url24', 'example.com/url45', 'example.com/url46', nan],
['example.com/url25', 'example.com/url45', 'example.com/url46', nan],
['example.com/url26', 'example.com/url45', 'example.com/url46', nan],
['example.com/url27', 'example.com/url45', 'example.com/url46', nan],
['example.com/url28', 'example.com/url45', 'example.com/url46', nan],
['example.com/url29', 'example.com/url45', 'example.com/url46', nan],
['example.com/url30', 'example.com/url45', 'example.com/url46', nan]]
How do I match each URL in each nested list with the last URL in the same list to produce the above list?
You'll need to determine the last valid item per row using groupby
+ last
, and then reshape your dataFrame and build a two-column mapping using melt
.
df.columns = range(len(df.columns))
df = (
df.assign(URL2=df.stack().groupby(level=0).last())
.melt('URL2', value_name='URL1')
.drop('variable', 1)
.dropna()
.drop_duplicates()
.query('URL1 != URL2')
.sort_index(axis=1)
.reset_index(drop=True)
)
df
URL1 URL2
0 example.com/url1 example.com/url2
1 example.com/url3 example.com/url5
2 example.com/url6 example.com/10
3 example.com/url9 example.com/url8
4 example.com/url23 example.com/url10
5 example.com/url24 example.com/url46
6 example.com/url25 example.com/url46
7 example.com/url26 example.com/url46
8 example.com/url27 example.com/url46
9 example.com/url28 example.com/url46
10 example.com/url29 example.com/url46
11 example.com/url30 example.com/url46
12 example.com/url4 example.com/url5
13 example.com/url7 example.com/10
14 example.com/url7 example.com/url8
15 example.com/url45 example.com/url46
16 example.com/url8 example.com/10