Search code examples
pythonrpandasdataframeduplicates

Read .csv and combine duplicate columns - pandas


I have a .csv file with 893 columns that I need to read into pandas (or R) to analyze. When the spreadsheet is generated it creates duplicate columns that need to be combined into one.

The problem I am having is that when I read the .csv into pandas or R to create a dataframe it automatically assigns a number to each additional duplicate column meaning they can't be grouped easily.

The original data is formatted like this:


****** PYTHON ******

!!!EXAMPLE!!!

import pandas as pd 

d = {'Name':["Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
                                "Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"],
                       "Dates":["2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
                                 "2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
                                 "2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
                                 "2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"],
                       "Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
                       "Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
                       "Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}

d = pd.DataFrame(d)

d

****** R ******

df_date <- data.frame( Name = c("Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
                                "Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"),
                       Dates = c("2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
                                 "2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
                                 "2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
                                 "2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"),
                       Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
                       Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
                       Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))

FYI - this is just an example. The real data will be read in from a .csv file using df = pd.read_csv("dummy.csv") or equivalent.

Is there any method to either;

  1. Read in a .csv and maintain the duplicate column names so they can be grouped and analyzed

or

  1. Group the duplicate column names at the point of reading the .csv in so they can be analyzed as one column

N.B: Interestingly I noticed when making the example that it won't even allow me to create a dataframe with columns the same name.


Solution

  • The problem is that you are creating a dict where the keys are not unique, so this can't be created in the form you want it to be (the values are just overwritten with the last one). The dict itself is then correctly handed to pandas and used to create the DataFrame.

    You could use for example a different method of adding the extra columns where you can explicitly allow duplicates.

    import pandas as pd
    
    d = {'Name': ["Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim",
                  "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue"],
         "Dates": ["2010-1-1", "2010-1-2", "2010-01-5", "2010-01-17", "2010-01-20",
                   "2010-01-29", "2010-02-6", "2010-02-9", "2010-02-16", "2010-02-28",
                   "2010-1-1", "2010-1-2", "2010-01-5", "2010-01-17", "2010-01-20",
                   "2010-01-29", "2010-02-6", "2010-02-9", "2010-02-16", "2010-02-28"],
         "Event": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
    
    d = pd.DataFrame(d)
    d.insert(len(d.columns), "Event", [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], allow_duplicates=True)
    d.insert(len(d.columns), "Event", [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], allow_duplicates=True)
    

    Which gives you:

    >    Name       Dates  Event  Event  Event
      0   Jim    2010-1-1      1      1      1
      1   Jim    2010-1-2      1      1      1
      2   Jim   2010-01-5      1      1      1
      3   Jim  2010-01-17      1      1      1
      4   Jim  2010-01-20      1      1      1
      5   Jim  2010-01-29      1      1      1
      6   Jim   2010-02-6      1      1      1
      7   Jim   2010-02-9      1      1      1
      8   Jim  2010-02-16      1      1      1
      9   Jim  2010-02-28      1      1      1
      10  Sue    2010-1-1      1      1      1
      11  Sue    2010-1-2      1      1      1
      12  Sue   2010-01-5      1      1      1
      13  Sue  2010-01-17      1      1      1
      14  Sue  2010-01-20      1      1      1
      15  Sue  2010-01-29      1      1      1
      16  Sue   2010-02-6      1      1      1
      17  Sue   2010-02-9      1      1      1
      18  Sue  2010-02-16      1      1      1
      19  Sue  2010-02-28      1      1      1
    

    Edit:

    If the columns exist but they just have an unwanted numbering [.1, .2, ...] you can use the re package:

    import pandas as pd
    import re
    
    
    df = pd.read_csv("example.csv", sep=";")
    >    a  b  b.1  b.2
      0  a  1    5    7
      1  b  2    5    7
      2  c  3    6    7
      3  d  4    7    7
      4  e  5    8    7
    
    df.columns = [re.sub("(.*?)(\.\d+)", "\\1", c) for c in df.columns]
    >    a  b  b  b
      0  a  1  5  7
      1  b  2  5  7
      2  c  3  6  7
      3  d  4  7  7
      4  e  5  8  7