Search code examples
pythonpandasanalysis

Split numeric values into given range


I have a data frame with a columns named 'age'. The ages range from 6 - 90. Is there a way to group ages in interval range as '5-9', '10-14' etc. So that we can display on a graph the age ranges between these instead of individual ages.


Solution

  • I am not an expert, but this is what I put together:

    • The cut function might be a way to go (see code below with an example)
    • When trying to add labels I run into an issue, and this post helped (that's why I used map)
    • The code I provide below assumes all age values are integers
    import pandas as pd
    import matplotlib.pyplot as plt
    
    # Sample data (to have something to work with in this example)
    data = {'age': [6, 10, 12, 15, 20, 22, 25, 30, 35, 52, 53, 54, 55, 60, 65, 70, 75, 84, 85, 90]}
    df = pd.DataFrame(data)
    
    # Define the age ranges 
    ## List of touples, each containing our min and max values of delimiter
    age_ranges = [(start, start + 4) for start in range(5, 86, 5)]
    ## [(5, 9), (10, 14), (15, 19), .....]
    
    # Adjust ranges for border values (5 is bigger than 4.9, 9 is smaller than 9.1, etc)
    # If entry data is not composed of integers this doesn't work (won't work well with float age values like 5.7, 15.2, etc) 
    ranges_adjusted=[(this_tuple[0]-0.1,this_tuple[1]+0.1) for this_tuple in age_ranges]
    # [(4.9, 9.1), (9.9, 14.1), (14.9, 19.1),.....]
    
    # Define the bins
    bins=pd.IntervalIndex.from_tuples(ranges_adjusted)
    
    # Define "nice-looking" labels (otherwise x axis will read "(4.9, 9.1] (9.9, 14.1] .....")
    labels=[f"{start}-{end}" for start, end in age_ranges]
    
    # Use "cut" method to group ages into ranges
    df['age_range'] = pd.cut(df['age'], 
                             bins=bins,
                             ).map(dict(zip(bins, labels)))
    
    # Count the occurrences of each age range
    age_counts = df['age_range'].value_counts().sort_index()
    
    # Plotting the data
    age_counts.plot(kind='bar', rot=0)
    plt.xlabel('Age Range')
    plt.ylabel('Count')
    plt.title('Age Distribution')
    plt.show()
    
    

    Again, not an expert here. I will be more careful next time posting an answer! See you around :)