Search code examples
rggplot2ecdf

Cumulative count of unique values over time


I have a dataframe mydf like this:

| Country    | Year |
| ---------- | ---- |
| Bahamas    | 1982 |
| Chile      | 1817 |
| Cuba       | 1960 |
| Finland    | 1918 |
| Kazakhstan | 1993 |

etc., with many more rows.

Is there an easy way to plot the cumulative number of unique countries over time? In other words,

  • x-axis = Year (a timeline), and
  • y-axis = cumulative number of countries that have already been mentioned

I tried stat_ecdf(), but the y-axis does not show the absolute count of countries:

ggplot(mydata, aes(x = Year)) + stat_ecdf()

This is an example of a mydf:

> dput(mydf)

structure(list(Country = c("Moldova", "Aragon", "Abu Dhabi", 
"Uzbekistan", "Sweden", "Anhalt", "Saudi Arabia", "Montenegro", 
"Central African Republic", "Bulgaria", "Argentina", "Senegal", 
"Sri Lanka", "Cambodia", "Benin", "Colombia", "Algeria", "Iraq", 
"DPRK", "Italy"), Year = c(1992L, 1223L, 1966L, 1993L, 1748L, 
1835L, 1955L, 1841L, 1959L, 1993L, 1806L, 1960L, 1955L, 1995L, 
1892L, 1914L, 1981L, 1958L, 1948L, 1900L)), row.names = c(NA, 
-20L), class = c("data.table", "data.frame"))

Solution

  • Give the countries an ID number based on first appearance, and then the cumulative count is the same as the cumulative max of that ID:

    mydf = mydf[order(mydf$Year, mydf$Country), ]
    mydf$country_id = as.integer(factor(mydf$Country, levels = unique(mydf$Country)))
    mydf$cum_n_country = cummax(mydf$country_id)
    

    If years are repeated, you'll need to aggregate/summarize the max cum_n_country by year.

    library(dplyr)
    library(ggplot2)
    mydf %>%
      group_by(Year) %>%
      summarize(cum_n_country = max(cum_n_country)) %>%
      ggplot(aes(x = Year, y = cum_n_country)) + 
      geom_line()