Search code examples
pythonpandasscipydata-cleaning

How can I transform a dataframe in pandas without losing my index?


I need to winsorize two columns in my dataframe of 12 columns.

Say, I have columns 'A', 'B', 'C', and 'D', each with a series of values. Given that I cleaned some NaN columns, the number of columns was reduced from 100 to 80, but they are still indexed to 100 with gaps (e.g. row 5 is missing).

I want to transform only columns 'A' and 'B' via winsorize method. To do this, I must convert my columns to a np.array.

import scipy.stats
df['A','B','C','D'] = #some values per each column
ab_df = df['A','B']
X = scipy.stats.mstats.winsorize(ab_df.values, limits=0.01)
new_ab_df = pd.DataFrame(X, columns = ['A','B'])
df = pd.concat([df['C','D'], new_ab_df], axis=1, join='inner', join_axes=[df.index])

When I convert to a np.array, then back to a pd.DataFrame, it's len() is correct at 80 but my indexes have been reset to be 0->80. How can I ensure that my transform 'A' and 'B' columns are indexed correctly? I don't think I can use the apply(), which would preserve index order and simply swap out the values instead of my approach, which creates a transformed copy of my df with only 2 columns, then concats them to the rest of my non-transformed columns.


Solution

  • You can do this inplace to the original dataframe.

    From the description of your question, it sounds like you are confusing rows and columns (i.e. you first say your dataframe has 12 columns, and then say the number of columns was reduced from 100 to 80).

    It is always best to provide a minimal example of data in your question. Lacking this, here is some data based on my assumptions:

    import numpy as np
    import scipy.stats
    import pandas as pd
    
    np.random.seed(0)
    df = pd.DataFrame(np.random.randn(7, 5), columns=list('ABCDE'))
    df.iat[1, 0] = np.nan
    df.iat[3, 1] = np.nan
    df.iat[5, 2] = np.nan
    
    >>> df
              A         B         C         D         E
    0  1.764052  0.400157  0.978738  2.240893  1.867558
    1       NaN  0.950088 -0.151357 -0.103219  0.410599
    2  0.144044  1.454274  0.761038  0.121675  0.443863
    3  0.333674       NaN -0.205158  0.313068 -0.854096
    4 -2.552990  0.653619  0.864436 -0.742165  2.269755
    5 -1.454366  0.045759       NaN  1.532779  1.469359
    6  0.154947  0.378163 -0.887786 -1.980796 -0.347912
    

    My assumption is to drop any row with a NaN, and then winsorize.

    mask = df.notnull().all(axis=1), ['A', 'B']
    df.loc[mask] = scipy.stats.mstats.winsorize(df.loc[mask].values, limits=0.4)
    

    I applied a high limit to the winsorize function so that the results are more obvious on this small dataset.

    >>> df
              A         B         C         D         E
    0  0.400157  0.400157  0.978738  2.240893  1.867558
    1       NaN  0.950088 -0.151357 -0.103219  0.410599
    2  0.378163  0.400157  0.761038  0.121675  0.443863
    3  0.333674       NaN -0.205158  0.313068 -0.854096
    4  0.378163  0.400157  0.864436 -0.742165  2.269755
    5 -1.454366  0.045759       NaN  1.532779  1.469359
    6  0.378163  0.378163 -0.887786 -1.980796 -0.347912