python pandas dataframe numpy data-analysis

How to sum rows that start with the same string

I used pandas to clean a csv:

import pandas as pd 
import numpy as np
df = pd.read_csv(r'C:\Users\Leo90\Downloads\data-export.csv',encoding='utf-8', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
df=df.iloc[:,[0,1,2,3,4,5,6,7]]
df=df.replace(to_replace='None',value=np.nan).dropna()
df=df.reset_index(drop=True)
columnNames = df.iloc[0] 
df = df[1:] 
df.columns = columnNames
df.groupby('path').head()

The processed data like the screenshot below

Original data

I want to use python to make this dataframe look like this

enter image description here

I know I could use str.contain to match these strings but it can only return bool, so I can't sum the A&B columns. Are there are any solutions for this problem?

I tried str.contain to match these strings but I can't sum A&B.

Solution

Use groupby.sum on the starting string:

If the starting string is fixed, use .str[:n] to slice the first n chars

df = pd.DataFrame({'path': ['google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2})
#                     path  A  B
# 0       google.com/A/123  1  2
# 1       google.com/A/124  1  2
# 2       google.com/A/125  1  2
# 3      google.com/B/3333  1  2
# 4  google.com/C/11111111  1  2
# 5  google.com/C/11111113  1  2

start = df['path'].str[:12]  # first 12 chars of df['path']
out = df.groupby(start).sum()
#               A  B
# path              
# google.com/A  3  6
# google.com/B  1  2
# google.com/C  2  4

If the starting string is dynamic, use .str.extract() to capture the desired pattern (e.g., up to the 2nd slash)

df = pd.DataFrame({'path': ['A.com/A', 'A.com/A/B/C', 'google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2})
#                     path  A  B
# 0                A.com/A  1  2
# 1            A.com/A/B/C  1  2
# 2       google.com/A/123  1  2
# 3       google.com/A/124  1  2
# 4       google.com/A/125  1  2
# 5      google.com/B/3333  1  2
# 6  google.com/C/11111111  1  2
# 7  google.com/C/11111113  1  2

start = df['path'].str.extract(r'^([^/]+/[^/]+)', expand=False)  # up to 2nd slash of df['path']
out = df.groupby(start).sum()
#               A  B
# path              
# A.com/A       2  4
# google.com/A  3  6
# google.com/B  1  2
# google.com/C  2  4

Breakdown of ^([^/]+/[^/]+) regex:

regex	meaning
`^`	beginning of string
`(`	start capturing
`[^/]+`	non-slash characters
`/`	slash character
`[^/]+`	non-slash characters
`)`	stop capturing