Search code examples
pythonpandasglob

Reindex from duplicate axis error while using glob.glob


When I try to run this code:

import pandas as pd
import glob
files = [pd.read_excel(p, skipfooter=1) for p in glob.glob("path/*.xlsx")]
df = files[0].append([files[i] for i in range(1,len(files))])       
df.loc[(df[df.columns[6]] == 2002040041),'New Column'] = df[df.columns[2]]

I get the following error:

ValueError Traceback (most recent call last) in

---> 5 df.loc[(df[df.columns[6]] == 2002040041),'New Column'] = df[df.columns[2]]

ValueError: cannot reindex from a duplicate axis

However, when I write a .csv, and then read it, the problem is solved.

import pandas as pd
import glob
files = [pd.read_excel(p, skipfooter=1) for p in glob.glob("path/*.xlsx")]
df = files[0].append([files[i] for i in range(1,len(files))])
df.to_csv("file.csv", encoding='utf-8-sig')
df = pd.read_csv("file.csv") 
df.loc[(df[df.columns[6]] == 2002040041),'New Column'] = df[df.columns[2]]

If I use df.to_excel instead, the error persists.

Why is this error happening? How can I solve it without writing a .csv file? All column names are the same in every file and different from each other.


Solution

  • You need default index by reset_index(drop=True):

    df = df.reset_index(drop=True)
    

    Or by parameter ignore_index=True in DataFrame.append:

    df = files[0].append([files[i] for i in range(1,len(files))], ignore_index=True)  
    

    Then code should be simplify:

    df.loc[df.iloc[:, 6] == 2002040041, 'New Column'] = df.iloc[:, 2]