Search code examples
pythonpandasdataframetypeerrorscalar

Errors in converting numeric data frame to integer in pandas -- "only integer scalar arrays can be converted to a scalar index"


I have a large dataset and am trying to convert 'object' columns containing only numeric data to 'integer' datatype in python/pandas. With every code I have attempted, I have received the following error:

CODE SNIPPET (see below for options I have tried)
PATH/frame.py in __setiten__(self, key, value)
     3482              self._setitem_frame(key, value)
     3483         elif isinstance(key, (Series, np.ndarray, list, Index)):
  -->3484              self._setiten_array(key, value)
     3485         else: 

PATH/frame.py in _setitem_array(self, key, value)
     3507                  raise ValueError("Columns must be same length as key")
     3508              for k1, k2 in zip(key, value.columns):
  -->3509                  self[k1] = value[k2]
     3510           else: 
     3511              indexer = self.loc._convert_to_indexer(key, axis=1)
    
PATH/frame.py in __setitem__(self, key, value)
     3485         else: 
     3486             #set column
  -->3487             self._set_item(key, value)
     3488
     3489    def _setitem_slice(self, key, value):

PATH/frame.py in _set_item(self, key, value)
     3562
     3563     self._ensure_valid_index(value)
  -->3564     value = self._sanitize_column(key, value)
     3565     NDFrame._set_item(self, key, value)

PATH/frame.py in _sanitize_column(self, key, value, broadcast)
     3778     if broadcast and key in self.columns and value.ndim == 1: 
     3780         if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
  -->3781             existing_piece = self[key]
     3782             if isinstance(existing_piece, DataFrame):
     3783                 value = np.tile(value, (len(existing_piece.columns), 1))

PATH/frame.py in __getitem__(self, key)
     2971     if self.columns.nlevels > 1:
     2972          return self.getitem_multilevel(key)
  -->2973     return self.__get_item_cache(key_
     2974
     2975     # Do we have a slicer (on rows)?

PATH/generic.py in _get_item_cache(self, item)
     3268    res = cache.get(item)
     3269    if res is None:
  -->3270         values = self.data.get(item)
     3271         res = self.box_item_values(item, values)
     3272         cache[item] = res

PATH/managers.py in get(self, item)
     958                      raise ValueError("cannot label index with a null key")
     959      
  -->960                return self.iget(loc)
     961          else:
     962
    
PATH/managers.py in iget(self, i)
     975     Otherwise return as a ndarray
     976     """
  -->977     block = self.blocks[self.blknos[i]]
     978     values = block.iget(self._blklocks[i])
     978     if values.ndi != 1:

    TypeError: only integer scalar arrays can be concerted to a scalar index

What I have tried, all of which turned back the (above) error:

df[["column1", "column 2", "column 3", "column 4"]] = df[["column 1", "column 2", "column 3", "column 4"]].apply(pd.to_numeric, errors='raise')

AND

df[["column1", "column 2", "column 3", "column 4"]] = df[["column 1", "column 2", "column 3", "column 4"]].apply(pd.to_numeric, errors='raise')

WHERE, df = data frame name in python; column 1, etc = column names in python

I have also tried:

df["column1"] = df["column1"].astype(str).astype(int)

AND

df["column1"] = pd.numeric(df["column1"], errors = 'coerce')

which also returned the same error. ADDITIONAL Attempts after first post: I have also tried--

def convert_numbers(val):
    """
    Convert number string to integer
    """
    new_val = val
    return int(new_val)

df["column1"].apply(convert_numbers)

which again returned the same error.

I did double check the data types. df.dtypes shows the data types for the columns I'm trying to change as "object" no matter what I do. I double checked the code, and there are no missing/null values for the columns in question. I also checked the formatting, and the columns are entirely numeric. One column is formatted with three numbers (i.e. 207, 710, 115), another is formatted with two numbers (01, 02, 03), and the final is formatted with five numbers (00001, 00002, 00003)....

Any help on this would be appreciated. If I find the answer i will post it here.


Solution

  • I found an answer. The problem could be that I am working with an Oracle database connection, I'm not sure. I would still love to hear more comments if anyone has a simpler way to do this in Python, but here's how I did it:

    #coerce stores all non-convertible values as NA and ignore keeps original values, so column may have mixed data types. 
    df['column names'] = df[['column names']].apply(pd.to_numeric, errors = 'coerce').fillna(df)
    

    Beware that using coerce with non-numeric items may remove their data and switch it to NA. :) This worked though!