I have a 1m row df with one column that is always 5000 characters, A-Z0-9
I parse the long column into 972 columns using :
def parse_long_string(df):
df['a001'] = df['long_string'].str[0:2]
df['a002'] = df['long_string'].str[2:4]
df['a003'] = df['long_string'].str[4:13]
df['a004'] = df['long_string'].str[13:22]
df['a005'] = df['long_string'].str[22:31]
df['a006'] = df['long_string'].str[31:40]
....
df['a972'] = df['long_string'].str[4994:]
return(df)
When I call the function, I get the following warning:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
Reading PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance, this issue arises when creating > 100 columns and not specifying the data type of the new column, but each column is automatically a string.
Is there a way around this other than : warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
?
I don't know how you ended up with such a configuration but yes I can trigger the PerformanceWarning
on a similar dataframe/code like yours. So, here is a possible solution to get rid of the warning, using concat
:
slices = {
"a001": (0, 2),
"a002": (2, 4),
"a003": (4, 13),
"a004": (13, 22),
"a005": (22, 31),
"a006": (31, 40),
# ... add the rest here
"a972": (4994, None)
} # # I used a dict but you can choose a list as well
def parse_long_string(df, mapper):
new_cols = pd.concat(
{
col: df["long_string"].str[s:e]
for col, (s, e) in mapper.items()
}, axis=1
)
return df.join(new_cols)
out = parse_long_string(df, slices)
Output :
print(out)
long_string a001 a002 a003 a004 ... a968 a969 a970 a971 a972
0 ILR03X... IL R0 3X 3D ... wm xC 95 cZ GT
1 uluF81... ul uF 81 Jl ... 98 RE 80 wc Qk
2 NLRCIh... NL RC Ih t4 ... Xk os KL Ge lp
3 ScrgOj... Sc rg Oj GS ... nM 8T gy Ju 8z
4 saWtdD... sa Wt dD zN ... cf o2 xX hM ze
... ... ... ... ... ... ... ... ... ... ... ...
9995 4FxlzY... 4F xl zY 6b ... fi Mb V9 Vf bK
9996 hsjUFa... hs jU Fa fL ... Io ka SJ 73 hM
9997 Sr4zFU... Sr 4z FU 3c ... yb 6a AF lv P4
9998 q4eon1... q4 eo n1 Kg ... 9g u1 dq sj Wa
9999 5UxVXL... 5U xV XL f2 ... zC 6F 7T kE kt
[10000 rows x 973 columns]
Used input :
import numpy as np
import pandas as pd
import string
np.random.seed(0)
df = pd.DataFrame({
"long_string": ["".join(np.random.choice(
[*string.printable[:62]], size=5000)) for _ in range(10000)]
})
slices = {f"a{i+1:03d}": (i*2, (i+1)*2) for i in range(972)}