Search code examples
pythonpandasdataframenumpydata-analysis

How to sum rows that start with the same string


I used pandas to clean a csv:

import pandas as pd 
import numpy as np
df = pd.read_csv(r'C:\Users\Leo90\Downloads\data-export.csv',encoding='utf-8', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
df=df.iloc[:,[0,1,2,3,4,5,6,7]]
df=df.replace(to_replace='None',value=np.nan).dropna()
df=df.reset_index(drop=True)
columnNames = df.iloc[0] 
df = df[1:] 
df.columns = columnNames
df.groupby('path').head()

The processed data like the screenshot below

Original data

I want to use python to make this dataframe look like this

enter image description here

I know I could use str.contain to match these strings but it can only return bool, so I can't sum the A&B columns. Are there are any solutions for this problem?

I tried str.contain to match these strings but I can't sum A&B.


Solution

  • Use groupby.sum on the starting string:

    • If the starting string is fixed, use .str[:n] to slice the first n chars

      df = pd.DataFrame({'path': ['google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2})
      #                     path  A  B
      # 0       google.com/A/123  1  2
      # 1       google.com/A/124  1  2
      # 2       google.com/A/125  1  2
      # 3      google.com/B/3333  1  2
      # 4  google.com/C/11111111  1  2
      # 5  google.com/C/11111113  1  2
      
      start = df['path'].str[:12]  # first 12 chars of df['path']
      out = df.groupby(start).sum()
      #               A  B
      # path              
      # google.com/A  3  6
      # google.com/B  1  2
      # google.com/C  2  4
      
    • If the starting string is dynamic, use .str.extract() to capture the desired pattern (e.g., up to the 2nd slash)

      df = pd.DataFrame({'path': ['A.com/A', 'A.com/A/B/C', 'google.com/A/123', 'google.com/A/124', 'google.com/A/125', 'google.com/B/3333', 'google.com/C/11111111', 'google.com/C/11111113'], 'A': 1, 'B': 2})
      #                     path  A  B
      # 0                A.com/A  1  2
      # 1            A.com/A/B/C  1  2
      # 2       google.com/A/123  1  2
      # 3       google.com/A/124  1  2
      # 4       google.com/A/125  1  2
      # 5      google.com/B/3333  1  2
      # 6  google.com/C/11111111  1  2
      # 7  google.com/C/11111113  1  2
      
      start = df['path'].str.extract(r'^([^/]+/[^/]+)', expand=False)  # up to 2nd slash of df['path']
      out = df.groupby(start).sum()
      #               A  B
      # path              
      # A.com/A       2  4
      # google.com/A  3  6
      # google.com/B  1  2
      # google.com/C  2  4
      

      Breakdown of ^([^/]+/[^/]+) regex:

      regex meaning
      ^ beginning of string
      ( start capturing
      [^/]+ non-slash characters
      / slash character
      [^/]+ non-slash characters
      ) stop capturing