Search code examples
pythonstatisticsbigdatacluster-analysismultilabel-classification

Statistical and hierarchical Analyze between 3 Fileds


I have a dataset (Excel file) includes three fields of District (string) , Land Use (string), and Temperature (numeric). By the way the overall numbers of district and Land Use are limited, while Temperature value are various.

with about thousands of records like a bigdata ...

partially something like below table:

| District| Land Use    | Temperature |
|---------|-------------|-------------|
| B       | Desert      | 43.3        |
| A       | Residential | 23.1        |
| C       | Forest      | 14.6        |
| B       | Forest      | 18.3        |
| A       | Wetland     | 15.8        |
| B       | Residential | 25.9        |
| C       | Agricultural| 37.0        |
| A       | Residential | 29.1        |
| B       | Desert      | 44.5        |
| C       | Residential | 31.6        |
| A       | Forest      | 17.4        |
| B       | Residential | 23.2        |
| A       | Forest      | 18.8        |
| C       | Agricultural| 36.7        |
| A       | Residential | 29.2        |
| C       | Forest      | 17.6        |
| A       | Agricultural| 36.9        |
| B       | Desert      | 15.5        |
....
| H       | Residential | 26.9        |
| I       | Agricultural| 27.0        |
| N       | Residential | 22.1        |
| B       | Desert      | 47.5        |

Is there any automatic method to cluster entire data set in way that describe statistically each district based on it's own Land use (mean, median, Std., and etc.)?

i want to get something like this

Temperature District A
                  Residential   mean = xxx , Std. = xxx
                  Agricultural  mean = xxx , Std. = xxx
                  Forest        mean = xxx , Std. = xxx
                  Wetland       mean = xxx , Std. = xxx
Temperature District B
                  Residential   mean = xxx , Std. = xxx
                  Agricultural  mean = xxx , Std. = xxx
                  Forest        mean = xxx , Std. = xxx
                  Desert        mean = xxx , Std. = xxx
Temperature District C
                  Residential   mean = xxx , Std. = xxx
                  Agricultural  mean = xxx , Std. = xxx
                  Forest        mean = xxx , Std. = xxx
....
Temperature District N
                  Residential   mean = xxx , Std. = xxx
                  Agricultural  mean = xxx , Std. = xxx
                  Forest        mean = xxx , Std. = xxx

Solution

  • Although it's not exactly in the format you specified, you can get the mean and std for every district and save it to a dataframe with groupby() and agg(). agg() supports multiple summary functions at once.

    data = {'District': ['B', 'A', 'C', 'B', 'A', 'B', 'C'],
            'Land Use': ['Desert', 'Residential', 'Forest', 'Forest', 'Wetland', 'Residential', 'Agricultural'],
            'Temperature': [43.3, 23.1, 14.6, 18.3, 15.8, 25.9, 37.0]
           }
    
    df = pd.DataFrame(data)
    
    df_stats = df.groupby(['District', 'Land Use'])['Temperature'].agg(['mean', 'std'])
    

    Output:

                           mean   std
    District Land Use                
    A        Residential   23.1   ...
             Wetland       15.8   ...
    B        Desert        43.3   ...
             Forest        18.3   ...
             Residential   25.9   ...
    C        Agricultural  37.0   ...
             Forest        14.6   ...