I have a large dataset and am trying to convert 'object' columns containing only numeric data to 'integer' datatype in python/pandas. With every code I have attempted, I have received the following error:
CODE SNIPPET (see below for options I have tried)
PATH/frame.py in __setiten__(self, key, value)
3482 self._setitem_frame(key, value)
3483 elif isinstance(key, (Series, np.ndarray, list, Index)):
-->3484 self._setiten_array(key, value)
3485 else:
PATH/frame.py in _setitem_array(self, key, value)
3507 raise ValueError("Columns must be same length as key")
3508 for k1, k2 in zip(key, value.columns):
-->3509 self[k1] = value[k2]
3510 else:
3511 indexer = self.loc._convert_to_indexer(key, axis=1)
PATH/frame.py in __setitem__(self, key, value)
3485 else:
3486 #set column
-->3487 self._set_item(key, value)
3488
3489 def _setitem_slice(self, key, value):
PATH/frame.py in _set_item(self, key, value)
3562
3563 self._ensure_valid_index(value)
-->3564 value = self._sanitize_column(key, value)
3565 NDFrame._set_item(self, key, value)
PATH/frame.py in _sanitize_column(self, key, value, broadcast)
3778 if broadcast and key in self.columns and value.ndim == 1:
3780 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
-->3781 existing_piece = self[key]
3782 if isinstance(existing_piece, DataFrame):
3783 value = np.tile(value, (len(existing_piece.columns), 1))
PATH/frame.py in __getitem__(self, key)
2971 if self.columns.nlevels > 1:
2972 return self.getitem_multilevel(key)
-->2973 return self.__get_item_cache(key_
2974
2975 # Do we have a slicer (on rows)?
PATH/generic.py in _get_item_cache(self, item)
3268 res = cache.get(item)
3269 if res is None:
-->3270 values = self.data.get(item)
3271 res = self.box_item_values(item, values)
3272 cache[item] = res
PATH/managers.py in get(self, item)
958 raise ValueError("cannot label index with a null key")
959
-->960 return self.iget(loc)
961 else:
962
PATH/managers.py in iget(self, i)
975 Otherwise return as a ndarray
976 """
-->977 block = self.blocks[self.blknos[i]]
978 values = block.iget(self._blklocks[i])
978 if values.ndi != 1:
TypeError: only integer scalar arrays can be concerted to a scalar index
What I have tried, all of which turned back the (above) error:
df[["column1", "column 2", "column 3", "column 4"]] = df[["column 1", "column 2", "column 3", "column 4"]].apply(pd.to_numeric, errors='raise')
AND
df[["column1", "column 2", "column 3", "column 4"]] = df[["column 1", "column 2", "column 3", "column 4"]].apply(pd.to_numeric, errors='raise')
WHERE, df = data frame name in python; column 1, etc = column names in python
I have also tried:
df["column1"] = df["column1"].astype(str).astype(int)
AND
df["column1"] = pd.numeric(df["column1"], errors = 'coerce')
which also returned the same error. ADDITIONAL Attempts after first post: I have also tried--
def convert_numbers(val):
"""
Convert number string to integer
"""
new_val = val
return int(new_val)
df["column1"].apply(convert_numbers)
which again returned the same error.
I did double check the data types. df.dtypes
shows the data types for the columns I'm trying to change as "object" no matter what I do. I double checked the code, and there are no missing/null values for the columns in question. I also checked the formatting, and the columns are entirely numeric. One column is formatted with three numbers (i.e. 207, 710, 115), another is formatted with two numbers (01, 02, 03), and the final is formatted with five numbers (00001, 00002, 00003)....
Any help on this would be appreciated. If I find the answer i will post it here.
I found an answer. The problem could be that I am working with an Oracle database connection, I'm not sure. I would still love to hear more comments if anyone has a simpler way to do this in Python, but here's how I did it:
#coerce stores all non-convertible values as NA and ignore keeps original values, so column may have mixed data types.
df['column names'] = df[['column names']].apply(pd.to_numeric, errors = 'coerce').fillna(df)
Beware that using coerce with non-numeric items may remove their data and switch it to NA. :) This worked though!