Search code examples
pythonpandasloopsiterationtrain-test-split

Why is iterating through my pandas data changing the values?


I am trying to build a function that will allow me to iterate through a row of a pandas dataframe and change its values of "yes", "maybe", or "no" to 1, 0, and -1 respectively. I've done this before using the exact same process but for some reason, this time, it's giving me a key error. When it wasn't working, I tried to simplify it to see if the iterator was working properly and found that the iterator is somehow changing my data. Using the code below

def testing(data):
    print(data)
    for i in range(len(data)):
        print(data[i])

testing(train_x['Values'])

The function returns the following and then hits 'Key Error: 7'

137       no
84        no
27       yes
127    maybe
132       no
       ...  
9         no
103      yes
67        no
117    maybe
47        no
Name: Value, Length: 120, dtype: object
yes
no
no
no
no
no
no

Does anyone know why this is occurring? Does it have something to do with the values being shuffled due to train_tests_split? The last time I did this, I did it prior to the train_test_split and it worked perfectly fine but since then, I've realized data preprocessing is more effective if done after the split in order to stop data leakage. If the split is the problem, is there a way to solve this issue using a different iterator?


Solution

  • The train_test_split does shuffle the values. You might want to try this:

    Replace:

    def testing(data):
        print(data)
        for i in range(len(data)):
            print(data[i])
    

    With:

    def testing(data):
        for i in data.index:
            print(data.iloc[i])