Search code examples
pythondataframeipythongspread

DataFrame max() not return max


Real beginner question here, but it is so simple, I'm genuinely stumped. Python/DataFrame newbie.

I've loaded a DataFrame from a Google Sheet, however any graphing or attempts at calculations are generating bogus results. Loading code:

# Setup
!pip install --upgrade -q gspread

from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

worksheet = gc.open('Linear Regression - Brain vs. Body Predictor').worksheet("Raw Data")

rows = worksheet.get_all_values()

# Convert to a DataFrame and render.
import pandas as pd
df = pd.DataFrame.from_records(rows)

This seems to work fine and the data looks to be correctly loaded when I print out the DataFrame but running max() returns obviously false results. For example:

print(df[0])
print(df[0].max())

Will output:

0     3.385
1      0.48
2      1.35
3       465
4     36.33
5     27.66
6     14.83
7      1.04
8      4.19
9     0.425
10    0.101
11     0.92
12        1
13    0.005
14     0.06
15      3.5
16        2
17      1.7
18     2547
19    0.023
20    187.1
21      521
22    0.785
23       10
24      3.3
25      0.2
26     1.41
27      529
28      207
29       85
      ...  
32     6654
33      3.5
34      6.8
35       35
36     4.05
37     0.12
38    0.023
39     0.01
40      1.4
41      250
42      2.5
43     55.5
44      100
45    52.16
46    10.55
47     0.55
48       60
49      3.6
50    4.288
51     0.28
52    0.075
53    0.122
54    0.048
55      192
56        3
57      160
58      0.9
59     1.62
60    0.104
61    4.235
Name: 0, Length: 62, dtype: object
Max: 85

Obviously, the maximum value is way out -- it should be 6654, not 85.

What on earth am I doing wrong?

First StackOverflow post, so thanks in advance.


Solution

  • If you check it, you'll see at the end of your print() that dtype=object. Also, you'll notice your pandas Series have "int" values along with "float" values (e.g. you have 6654 and 3.5 in the same Series).

    These are good hints you have a series of strings, and the max operator here is comparing based on string comparing. You want, however, to have a series of numbers (specifically floats) and to compare based on number comparing.

    Check the following reproducible example:

    >>> df = pd.DataFrame({'col': ['0.02', '9', '85']}, dtype=object)
    >>> df.col.max()
    '9'
    

    You can check that because

    >>> '9' > '85'
    True
    

    You want these values to be considered floats instead. Use pd.to_numeric

    >>> df['col'] = pd.to_numeric(df.col)
    >>> df.col.max()
    85
    

    For more on str and int comparison, check this question