I want to create a new column and calculate the value based on the other column values of the dataframe. The input data is a large .csv file containing temperature values for each hour of the day, for multiple years. The dataframe looks like this:
HH T date t_min t_max
0 8 94.0 1991-04-01 81 110
1 9 90.0 1991-04-01 81 110
2 10 95.0 1991-04-01 81 110
3 11 108.0 1991-04-01 81 110
4 12 110.0 1991-04-01 81 110
5 13 109.0 1991-04-01 81 110
6 14 81.0 1991-04-01 81 110
7 15 85.0 1991-04-01 81 110
8 16 85.0 1991-04-01 81 110
9 17 87.0 1991-04-01 81 110
HH = hours; T = temperature of the hour; t_min = lowest day temp; t_max = highest day temp
I tried calculating a new column "HTD" (hourly temp deviation) with the following code:
import pandas as pd
df_t = pd.read_csv('file.csv')
# calcualation = ( Tu - Tmin ) / ( Tmax - Tmin ) * (Tmax - Tmin)
df_t[ 'HTD'] = (df_t.T - df_t.t_min) / (df_t.t_max - df_t.t_min) * (df_t.t_max - df_t.t_min)
This results in a TypeError: unsupported operand type(s) for -: 'str' and 'int' at the last line. The problem seems to be the T column, for the code runs when I use df_t.t_min instead of df_t.T. I checked the data of the T column:
#First check: 0 values
#Second check: non-numerical values:
result = df_t.applymap(np.isreal)
for value in result['T']:
if value == False:
Which showed no Null values, and reported no non-numerical values. I also tried using .astype() to make sure the data is the right type.
What is my best course of action to try solve this issue? (apologies if my question is incomplete or unclear, this is my first time)
Normally you can use df.<colname>
as a shortcut for df['colname']
, however df.T is a property in pandas that returns the transpose of the dataframe (rows become columns, columns become rows).
As a simple example:
df = pd.DataFrame({"Something": [1,2,3], 'T': [35,36,37]})
df.T # returns the transpose, not the 'T' column
You can fix this by simply accessing the column using square brackets:
df["T"] # returns the column you want
Or by renaming your column to one that doesn't clash with a built in pandas dataframe property or method (other examples include df.shape, df.size etc; use dir(df)
to see all the potential clashes!)
In general the square bracket access is safer, if a bit less convinient, as you will be guaranteed to never clash names. I would stick to the .attribute access only as a shortcut or for names I am certain will not clash (columns with an uppercase name for example)