I have a .csv file with 893 columns that I need to read into pandas (or R) to analyze. When the spreadsheet is generated it creates duplicate columns that need to be combined into one.
The problem I am having is that when I read the .csv into pandas or R to create a dataframe it automatically assigns a number to each additional duplicate column meaning they can't be grouped easily.
The original data is formatted like this:
****** PYTHON ******
!!!EXAMPLE!!!
import pandas as pd
d = {'Name':["Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
"Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"],
"Dates":["2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"],
"Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
"Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
"Event" : [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
d = pd.DataFrame(d)
d
****** R ******
df_date <- data.frame( Name = c("Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim","Jim",
"Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue","Sue"),
Dates = c("2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5","2010-01-17","2010-01-20",
"2010-01-29","2010-02-6","2010-02-9","2010-02-16","2010-02-28"),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
Event = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1))
FYI - this is just an example. The real data will be read in from a .csv file using df = pd.read_csv("dummy.csv")
or equivalent.
Is there any method to either;
or
N.B: Interestingly I noticed when making the example that it won't even allow me to create a dataframe with columns the same name.
The problem is that you are creating a dict where the keys are not unique, so this can't be created in the form you want it to be (the values are just overwritten with the last one). The dict itself is then correctly handed to pandas and used to create the DataFrame.
You could use for example a different method of adding the extra columns where you can explicitly allow duplicates.
import pandas as pd
d = {'Name': ["Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim",
"Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue", "Sue"],
"Dates": ["2010-1-1", "2010-1-2", "2010-01-5", "2010-01-17", "2010-01-20",
"2010-01-29", "2010-02-6", "2010-02-9", "2010-02-16", "2010-02-28",
"2010-1-1", "2010-1-2", "2010-01-5", "2010-01-17", "2010-01-20",
"2010-01-29", "2010-02-6", "2010-02-9", "2010-02-16", "2010-02-28"],
"Event": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
d = pd.DataFrame(d)
d.insert(len(d.columns), "Event", [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], allow_duplicates=True)
d.insert(len(d.columns), "Event", [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], allow_duplicates=True)
Which gives you:
> Name Dates Event Event Event
0 Jim 2010-1-1 1 1 1
1 Jim 2010-1-2 1 1 1
2 Jim 2010-01-5 1 1 1
3 Jim 2010-01-17 1 1 1
4 Jim 2010-01-20 1 1 1
5 Jim 2010-01-29 1 1 1
6 Jim 2010-02-6 1 1 1
7 Jim 2010-02-9 1 1 1
8 Jim 2010-02-16 1 1 1
9 Jim 2010-02-28 1 1 1
10 Sue 2010-1-1 1 1 1
11 Sue 2010-1-2 1 1 1
12 Sue 2010-01-5 1 1 1
13 Sue 2010-01-17 1 1 1
14 Sue 2010-01-20 1 1 1
15 Sue 2010-01-29 1 1 1
16 Sue 2010-02-6 1 1 1
17 Sue 2010-02-9 1 1 1
18 Sue 2010-02-16 1 1 1
19 Sue 2010-02-28 1 1 1
If the columns exist but they just have an unwanted numbering [.1, .2, ...]
you can use the re
package:
import pandas as pd
import re
df = pd.read_csv("example.csv", sep=";")
> a b b.1 b.2
0 a 1 5 7
1 b 2 5 7
2 c 3 6 7
3 d 4 7 7
4 e 5 8 7
df.columns = [re.sub("(.*?)(\.\d+)", "\\1", c) for c in df.columns]
> a b b b
0 a 1 5 7
1 b 2 5 7
2 c 3 6 7
3 d 4 7 7
4 e 5 8 7