I used pandas to clean a csv:
import pandas as pd
import numpy as np
df = pd.read_csv(r'C:\Users\Leo90\Downloads\data-export.csv',encoding='utf-8', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
df=df.iloc[:,[0,1,2,3,4,5,6,7]]
df=df.replace(to_replace='None',value=np.nan).dropna()
df=df.reset_index(drop=True)
columnNames = df.iloc[0]
df = df[1:]
df.columns = columnNames
df.groupby('path').head()
The processed data like the screenshot below
I want to use python to make this dataframe look like this
I know I could use str.contain
to match these strings but it can only return bool, so I can't sum the A&B columns. Are there are any solutions for this problem?
I tried str.contain
to match these strings but I can't sum A&B.
Use groupby.sum
on the starting string:
If the starting string is fixed, use .str[:n]
to slice the first n
chars
df = pd.DataFrame({'path': ['google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2})
# path A B
# 0 google.com/A/123 1 2
# 1 google.com/A/124 1 2
# 2 google.com/A/125 1 2
# 3 google.com/B/3333 1 2
# 4 google.com/C/11111111 1 2
# 5 google.com/C/11111113 1 2
start = df['path'].str[:12] # first 12 chars of df['path']
out = df.groupby(start).sum()
# A B
# path
# google.com/A 3 6
# google.com/B 1 2
# google.com/C 2 4
If the starting string is dynamic, use .str.extract()
to capture the desired pattern (e.g., up to the 2nd slash)
df = pd.DataFrame({'path': ['A.com/A', 'A.com/A/B/C', 'google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2})
# path A B
# 0 A.com/A 1 2
# 1 A.com/A/B/C 1 2
# 2 google.com/A/123 1 2
# 3 google.com/A/124 1 2
# 4 google.com/A/125 1 2
# 5 google.com/B/3333 1 2
# 6 google.com/C/11111111 1 2
# 7 google.com/C/11111113 1 2
start = df['path'].str.extract(r'^([^/]+/[^/]+)', expand=False) # up to 2nd slash of df['path']
out = df.groupby(start).sum()
# A B
# path
# A.com/A 2 4
# google.com/A 3 6
# google.com/B 1 2
# google.com/C 2 4
Breakdown of ^([^/]+/[^/]+)
regex:
regex | meaning |
---|---|
^ |
beginning of string |
( |
start capturing |
[^/]+ |
non-slash characters |
/ |
slash character |
[^/]+ |
non-slash characters |
) |
stop capturing |