I have the following dataset(sample):
df = pd.DataFrame({'col_1':['Region1 (Y0001)','Region2 (Y0002)',
'Region3 (Y0003)','Region4 (Y0004)','Region5 (Y0005)'],
'col_2':np.arange(1,6),
'col_3':np.arange(6,11),
'col_4':np.arange(11,16)})
NOTE: I had to change the real values, but the data types and structure are the same.
I can't get a hold of this error I get when using pd.insert()
.
df.insert(df.columns.get_loc('col_1'),
'new_col',
df['col_1'].str.extract(r'\((\w+)\)'))
I checked the correct functioning of pd.insert()
by running the following, and it worked!
df.insert(0,'Random_Col',55)
As far as I can tell, this error came up after I upgraded pandas to 1.4.3; I didn't have this issue before. However, this doesn't explain why the above check was executed flawlessly.
How can I resolve this error?
DataFrame.insert expects 3 positional arguments. loc
which is an int, column
which is a valid column name, and value
which is either a single value or 1 dimensional data (e.g. Series or array-like).
Currently (pandas 1.4.3) str.extract returns a DataFrame by default:
df['col_1'].str.extract(r'\((\w+)\)')
0
0 Y0001
1 Y0002
2 Y0003
3 Y0004
4 Y0005
The error message:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
is indicating that a 2-dimensional structure (a DataFrame) was provided as the value
to insert
which is 1 dimension more than the expected.
There are a few options to fix this.
expand=False
df.insert(
df.columns.get_loc('col_1'),
'new_col',
df['col_1'].str.extract(r'\((\w+)\)', expand=False)
)
OR
df.insert(
df.columns.get_loc('col_1'),
'new_col',
df['col_1'].str.extract(r'\((\w+)\)')[0] # Get capture group (column) 0
)
Either option produces df
:
new_col col_1 col_2 col_3 col_4
0 Y0001 Region1 (Y0001) 1 6 11
1 Y0002 Region2 (Y0002) 2 7 12
2 Y0003 Region3 (Y0003) 3 8 13
3 Y0004 Region4 (Y0004) 4 9 14
4 Y0005 Region5 (Y0005) 5 10 15