I have a dataset (Excel file) includes three fields of District (string) , Land Use (string), and Temperature (numeric). By the way the overall numbers of district and Land Use are limited, while Temperature value are various.
with about thousands of records like a bigdata ...
partially something like below table:
| District| Land Use | Temperature |
|---------|-------------|-------------|
| B | Desert | 43.3 |
| A | Residential | 23.1 |
| C | Forest | 14.6 |
| B | Forest | 18.3 |
| A | Wetland | 15.8 |
| B | Residential | 25.9 |
| C | Agricultural| 37.0 |
| A | Residential | 29.1 |
| B | Desert | 44.5 |
| C | Residential | 31.6 |
| A | Forest | 17.4 |
| B | Residential | 23.2 |
| A | Forest | 18.8 |
| C | Agricultural| 36.7 |
| A | Residential | 29.2 |
| C | Forest | 17.6 |
| A | Agricultural| 36.9 |
| B | Desert | 15.5 |
....
| H | Residential | 26.9 |
| I | Agricultural| 27.0 |
| N | Residential | 22.1 |
| B | Desert | 47.5 |
Is there any automatic method to cluster entire data set in way that describe statistically each district based on it's own Land use (mean, median, Std., and etc.)?
i want to get something like this
Temperature District A
Residential mean = xxx , Std. = xxx
Agricultural mean = xxx , Std. = xxx
Forest mean = xxx , Std. = xxx
Wetland mean = xxx , Std. = xxx
Temperature District B
Residential mean = xxx , Std. = xxx
Agricultural mean = xxx , Std. = xxx
Forest mean = xxx , Std. = xxx
Desert mean = xxx , Std. = xxx
Temperature District C
Residential mean = xxx , Std. = xxx
Agricultural mean = xxx , Std. = xxx
Forest mean = xxx , Std. = xxx
....
Temperature District N
Residential mean = xxx , Std. = xxx
Agricultural mean = xxx , Std. = xxx
Forest mean = xxx , Std. = xxx
Although it's not exactly in the format you specified, you can get the mean and std for every district and save it to a dataframe with groupby()
and agg()
. agg()
supports multiple summary functions at once.
data = {'District': ['B', 'A', 'C', 'B', 'A', 'B', 'C'],
'Land Use': ['Desert', 'Residential', 'Forest', 'Forest', 'Wetland', 'Residential', 'Agricultural'],
'Temperature': [43.3, 23.1, 14.6, 18.3, 15.8, 25.9, 37.0]
}
df = pd.DataFrame(data)
df_stats = df.groupby(['District', 'Land Use'])['Temperature'].agg(['mean', 'std'])
Output:
mean std
District Land Use
A Residential 23.1 ...
Wetland 15.8 ...
B Desert 43.3 ...
Forest 18.3 ...
Residential 25.9 ...
C Agricultural 37.0 ...
Forest 14.6 ...