I must be doing a very basic mistake. I am trying to select only certain columns from a dataframe, dropping the na rows. I also am supposed to reset the row index after removing the rows.
This is what my dataset looks like
CRIM ZN INDUS CHAS NOX ... TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 ... 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 ... 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 ... 242 17.8 392.83 4.03 34.7
This is what I have tried so far
F = HousingData.dropna(subset = ['CRIM', 'ZN', 'INDUS'])
this first attempt just gives no output
HousingData.select("CRIM").show("CRIM")
this one gives the error message AttributeError: 'DataFrame' object has no attribute 'select'
cheers!
there are few problems. first when you use dropna
you can indicate the parameter inplace=True
, or work with the output of the method which in your code you named F
.
Second I do belive that you are used to R and not python, whilst in R you select rows using select
in python do not; you can use either HousingData.loc[:, my_colum]
or HousingData["my_colum"]
here there is more info for pandas dataframe indexing
Finally, I'm not sure what you what to do with show()
but is also not valid for python you can use plot
, head
or values
...
HousingData.dropna(subset=['CRIM', 'ZN', 'INDUS'], inplace=True)
HousingData["CRIM"].plot() # visualize the first 5 values
# HousingData["CRIM"].head() # visualize the first 5 values
# if you don't use inplace=True
F = HousingData.dropna(subset=['CRIM', 'ZN', 'INDUS'])
F["CRIM"].plot()