pd.insert ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

I have the following dataset(sample):

df = pd.DataFrame({'col_1':['Region1 (Y0001)','Region2 (Y0002)',
                       'Region3 (Y0003)','Region4 (Y0004)','Region5 (Y0005)'],
              'col_2':np.arange(1,6),
              'col_3':np.arange(6,11),
              'col_4':np.arange(11,16)})

NOTE: I had to change the real values, but the data types and structure are the same.

I can't get a hold of this error I get when using pd.insert().

df.insert(df.columns.get_loc('col_1'),
      'new_col',
      df['col_1'].str.extract(r'\((\w+)\)'))

I checked the correct functioning of pd.insert() by running the following, and it worked!

df.insert(0,'Random_Col',55)

As far as I can tell, this error came up after I upgraded pandas to 1.4.3; I didn't have this issue before. However, this doesn't explain why the above check was executed flawlessly.

How can I resolve this error?

Solution

DataFrame.insert expects 3 positional arguments. loc which is an int, column which is a valid column name, and value which is either a single value or 1 dimensional data (e.g. Series or array-like).

Currently (pandas 1.4.3) str.extract returns a DataFrame by default:

df['col_1'].str.extract(r'\((\w+)\)')

       0
0  Y0001
1  Y0002
2  Y0003
3  Y0004
4  Y0005

The error message:

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

is indicating that a 2-dimensional structure (a DataFrame) was provided as the value to insert which is 1 dimension more than the expected.

There are a few options to fix this.

Since there is a single capture group we can stop the output from expanding into a DataFrame with expand=False

df.insert(
    df.columns.get_loc('col_1'),
    'new_col',
    df['col_1'].str.extract(r'\((\w+)\)', expand=False)
)

Select a column from the output. In this case column 0.

df.insert(
    df.columns.get_loc('col_1'),
    'new_col',
    df['col_1'].str.extract(r'\((\w+)\)')[0]  # Get capture group (column) 0
)

Either option produces df:

  new_col            col_1  col_2  col_3  col_4
0   Y0001  Region1 (Y0001)      1      6     11
1   Y0002  Region2 (Y0002)      2      7     12
2   Y0003  Region3 (Y0003)      3      8     13
3   Y0004  Region4 (Y0004)      4      9     14
4   Y0005  Region5 (Y0005)      5     10     15