Search code examples
pythondataframecalculated-columns

PerformanceWarning: DataFrame is highly fragmented when creating new columns


I have a 1m row df with one column that is always 5000 characters, A-Z0-9

I parse the long column into 972 columns using :

def parse_long_string(df):
    df['a001'] = df['long_string'].str[0:2]
    df['a002'] = df['long_string'].str[2:4]
    df['a003'] = df['long_string'].str[4:13]
    df['a004'] = df['long_string'].str[13:22]
    df['a005'] = df['long_string'].str[22:31]
    df['a006'] = df['long_string'].str[31:40]
    ....
    df['a972'] = df['long_string'].str[4994:]
    return(df)

When I call the function, I get the following warning:

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()

Reading PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance, this issue arises when creating > 100 columns and not specifying the data type of the new column, but each column is automatically a string.

Is there a way around this other than : warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning) ?


Solution

  • I don't know how you ended up with such a configuration but yes I can trigger the PerformanceWarning on a similar dataframe/code like yours. So, here is a possible solution to get rid of the warning, using concat :

    slices = {
        "a001": (0, 2),
        "a002": (2, 4),
        "a003": (4, 13),
        "a004": (13, 22),
        "a005": (22, 31),
        "a006": (31, 40),
        # ... add the rest here
        "a972": (4994, None)
    } # # I used a dict but you can choose a list as well
    
    def parse_long_string(df, mapper):
    
        new_cols = pd.concat(
            {
                col: df["long_string"].str[s:e]
                for col, (s, e) in mapper.items()
            }, axis=1
        )
    
        return df.join(new_cols)
    
    out = parse_long_string(df, slices)
    

    Output :

    print(out)
    
         long_string a001 a002 a003 a004  ... a968 a969 a970 a971 a972
    0     ILR03X...    IL   R0   3X   3D  ...   wm   xC   95   cZ   GT
    1     uluF81...    ul   uF   81   Jl  ...   98   RE   80   wc   Qk
    2     NLRCIh...    NL   RC   Ih   t4  ...   Xk   os   KL   Ge   lp
    3     ScrgOj...    Sc   rg   Oj   GS  ...   nM   8T   gy   Ju   8z
    4     saWtdD...    sa   Wt   dD   zN  ...   cf   o2   xX   hM   ze
    ...         ...   ...  ...  ...  ...  ...  ...  ...  ...  ...  ...
    9995  4FxlzY...    4F   xl   zY   6b  ...   fi   Mb   V9   Vf   bK
    9996  hsjUFa...    hs   jU   Fa   fL  ...   Io   ka   SJ   73   hM
    9997  Sr4zFU...    Sr   4z   FU   3c  ...   yb   6a   AF   lv   P4
    9998  q4eon1...    q4   eo   n1   Kg  ...   9g   u1   dq   sj   Wa
    9999  5UxVXL...    5U   xV   XL   f2  ...   zC   6F   7T   kE   kt
    
    [10000 rows x 973 columns]
    

    Used input :

    import numpy as np
    import pandas as pd
    import string
    
    np.random.seed(0)
    
    df = pd.DataFrame({
        "long_string": ["".join(np.random.choice(
            [*string.printable[:62]], size=5000)) for _ in range(10000)]
    })
    
    slices = {f"a{i+1:03d}": (i*2, (i+1)*2) for i in range(972)}