Search code examples
pythonpandasnumpy

Replace affects stuff it shouldn't


I read in an excel file into a pandas dataframe. The first row in the excel file is a row of headers. I read the row of headers into a numpy array. I then use the replace function to modify the text in the headers in the array. For some reason, this also changes the actual dataframe it seems like.

I actually wrote the initial code in Python 3.7, where it works just fine. But the exact same code does not work in Python 3.13.1 or Python 3.12.4.

Below is an example of what I do (I tested the script and it runs in all three python versions but it only produces the expected result in 3.7):

import pandas as pd
import numpy as np

df = pd.read_excel("path to excelfile")

# Based on header, I read in columns from the dataframe and do stuff with them.
header = df.keys()[1:] # Yes, I want to skip the first cell in the header.

# I use header_legend to make a nice legend in my plots.
header_legend = np.array(df.keys()[1:])

for i in range(0, len(header_legend)):
    header_legend[i] = header_legend[i].replace("word1_word2", "word1 word2")

print(header)
print(header_legend)

Example CSV file (I guess just convert it to an excel file using excel, I use xlsx-format):

word_skip,word1_word2,word3_word4,word5_word6
1,10,100,1000
2,20,200,2000
3,30,300,3000
4,40,400,4000
5,50,500,5000

Now, in Python 3.7, header remained unmodified and df.keys()[1:] remained unmodified. Only header_legend is modified by the replace function. But in Python 3.12 and 3.13, header and df.keys()[1:] are also modified, although I only apply the replace function to header_legend.

Why does applying replace to header_legend also change header and df.keys()[1:] in Python 3.12 and 3.13?


Solution

  • You seem to be relying on a view of the Index, this is generally a bad practice and prone to hidden issues like yours.

    You can be explicit, modify your names and reassign them to the DataFrame:

    header = list(df)
    header[1:] = [c.replace('word1_word2', 'word1 word2') for c in header[1:]]
    df.columns = header
    

    Or use rename with a dictionary (if the column names are unique):

    df.rename(columns={c: c.replace('word1_word2', 'word1 word2')
                       for c in df.columns[1:]}, inplace=True)
    

    Output:

       word_skip  word1 word2  word3_word4  word5_word6
    0          1           10          100         1000
    1          2           20          200         2000
    2          3           30          300         3000
    3          4           40          400         4000
    4          5           50          500         5000