Search code examples
pythonpandasnumpysortingdescribe

Pandas Dataframe from NumPy array - incorrect datatypes and can't change


I am trying to sort the following Pandas DataFrame in Python:

import numpy as np
import pandas as pd

heading_cols = [
"Video Title",
    "Up Ratings",
    "Down Ratings",
    "Views",
    "User Name",
    "Subscribers",
]
column_1 = [
    "Adelaide",
    "Brisbane",
    "Darwin",
    "Hobart",
    "Sydney",
    "Melbourne",
    "Perth",
]
column_2 = [1295, 5905, 112, 1357, 2058, 1566, 5386]
column_3 = [1158259, 1857594, 120900, 205556, 4336374, 3806092, 1554769]
column_4 = [600.5, 1146.4, 1714.7, 619.5, 1214.8, 646.9, 869.4]
column_5 = ["Bob","Tom","Dave","Sally","Rick","Mary","Roberta"]
column_6 = [25000,30000,15000,15005,20000,31111,11000]

#Generate data:
xdata_arr = np.array([column_1,column_2,column_3,column_4,column_5,column_6]).T

# Generate the DataFrame:
df = pd.DataFrame(xdata_arr, columns=heading_cols)
print(df)

The next 2 lines of code are causing problems:

# Print DataFrame and basic stats:
print(df["Up Ratings"].describe())
print(df.sort('Views', ascending=False))

Problems:

  • The sorting is not working for any column.
  • The statistics should include things like mean, std, min, max, etc. These do not show up.

The problem is that dtypes() is returning "object" for all the columns. This is wrong. some should be integers, but I can't figure out how to change only the numeric ones. I have tried:

df.convert_objects(convert_numeric=True)

but this is not working. So, then I went to the NumPy array and tried to change the dtypes there:

dt = np.dtype(
[
    (heading_cols[0], np.str_),
    (heading_cols[1], np.int16),
    (heading_cols[2], np.int16),
    (heading_cols[3], np.int16),
    (heading_cols[4], np.str_),
    (heading_cols[5], np.int16),
]

)

but this does not work either.

Is there a way to manually change the dtype to numeric?


Solution

  • Like most methods in pandas, convert_objects returns a NEW object.

    In [20]: df.convert_objects(convert_numeric=True)
    Out[20]: 
      Video Title  Up Ratings  Down Ratings   Views User Name  Subscribers
    0    Adelaide        1295       1158259   600.5       Bob        25000
    1    Brisbane        5905       1857594  1146.4       Tom        30000
    2      Darwin         112        120900  1714.7      Dave        15000
    3      Hobart        1357        205556   619.5     Sally        15005
    4      Sydney        2058       4336374  1214.8      Rick        20000
    5   Melbourne        1566       3806092   646.9      Mary        31111
    6       Perth        5386       1554769   869.4   Roberta        11000
    
    In [21]: df.convert_objects(convert_numeric=True).dtypes
    Out[21]: 
    Video Title      object
    Up Ratings        int64
    Down Ratings      int64
    Views           float64
    User Name        object
    Subscribers       int64
    dtype: object