Search code examples
rchartsbar-charthistogramfrequency

Create a histogram of aging


I am looking to create a histogram of aging. By aging, I'm referring to the accounting report of tracking an asset like accounts payable into bins of 30 days, 60 days, 90 days, and 120+ days old.

In my case, however, I need to track how many days has it been since statistics were last created for a table in a database using the same bins as accountants do. I hope that makes sense.

The data collected by my company's script has 2 variables of unpredictable amounts of observations. The two variables are NUM_TABLES (this is the number of tables updated) and STATS_DATE (this is the YYYY-MM-DD that statistics were last updated on that amount of tables).

> head(df)
  STATS_DATE NUM_TABLES
1   20210908          5
2   20240814        193
3   20240815        746

I would like to report this information in a histogram using the bins 30, 60, 90, 120+. The final chart should look something like the following idealized graph that doesn't represent the sample data above:

enter image description here

I am able to calculate the number of days from the target date.

# target date formatted to database's timezone  
DATE <- parse_date_time("2024-08-16", "ymd", tz = "US/Central")

# calculate difference between target date and date column in days
df$DAYS <- as.numeric(DATE - df$STATS_DATE)

What I can't seem to do is bring it all together using NUM_TABLES as the frequency.

Any help is greatly appreciated.

I've tried using R base hist function as well as the ggplot2 function. I've researched sites like StackOverflow, Statology, and others for various features and key words. Though, being new I do struggle with my lack of understanding for the R terminology.

UPDATE: Here is the output from my data as requested by Edward.

> dput(df)
structure(list(STATS_DATE = c(20210908L, 20240814L, 20240815L
), NUM_TABLES = c(5L, 193L, 746L)), class = "data.frame", row.names = c(NA, 3L))
> sapply(df, mode)
STATS_DATE NUM_TABLES 
 "numeric"  "numeric" 

I did apply the tag bar chart to this post because I was not sure if a histogram would be possible. Edward's suggestion is perfectly valid, and I really appreciate the input.

At this point I'm having some trouble with applying the histogram.

I believe my issue is with the STATS_DATE column.

When I use the line stats_dates <- stats_dates[stats_dates <= ref_date] it deletes all the data. When I remove that line I get the following error message: Error in hist.default(unclass(x), unclass(breaks), plot = FALSE, warn.unused = FALSE, : some 'x' not counted; maybe 'breaks' do not span range of 'x'.

Here is the code based on the responses, I know I'm doing something wrong prior to ref_date <- as.Date("2024-08-16").

raw <- read.csv("data/stats.del", header=FALSE, sep=",")

df <- data.frame(na.omit(raw))  ## remove rows from temp tables

colnames(df) <- c('STATS_DATE','NUM_TABLES')

# convert date column to correct timezone and format
df$STATS_DATE <- parse_date_time(df$STATS_DATE, "ymd", tz = "US/Central") ## I think my issue is here 

ref_date <- as.Date("2024-08-16")

stats_dates <- Map(rep, df$STATS_DATE, df$NUM_TABLES) |> unlist() |> as.Date()

# create the bins
bins <- seq.int(0, 120, length.out=5L)
names(bins) <- replace(bins, length(bins), paste0(bins[length(bins)], '+'))

# check bins
bins

# create the histogram
h <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)], 
             ref_date - bins[length(bins)]) |> 
  hist(breaks=ref_date - bins, freq=TRUE, xlab='Days Old', ylab='Num. Tables',
       xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat', rev=TRUE),
       main='')
mtext(text=names(bins)[-1L], side=1, line=1, at=h$mids)

# check bin counts
h$counts |> setNames(names(bins)[-1L])

Solution

  • First, you could use rep in Map to expand the date tables in a date vector stats_dates.

    > ref_date <- as.Date("2024-08-16")
    > 
    > stats_dates <- Map(rep, df$STATS_DATE, df$NUM_TABLES) |> unlist() |> as.Date()
    > stats_dates <- stats_dates[stats_dates <= ref_date]  ## delete everything after ref_date
    

    Define the bins.

    > bins <- seq.int(0, 120, length.out=5L)
    > ## naming the bins makes life easier:
    > names(bins) <- replace(bins, length(bins), paste0(bins[length(bins)], '+'))
    > bins
       0   30   60   90 120+ 
       0   30   60   90  120 
    

    Next, use replace to censor dates before ref_date - 120 days at ref_date - 120 to get'em in the same bin (aka breaks). As histogram-breaks= then use ref_date - bins. When using hist() on a "Date" object, graphics:::hist.Date is dispatched.

    > h <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)], 
    +              ref_date - bins[length(bins)]) |> 
    +   hist(breaks=ref_date - bins, freq=TRUE, xlab='Days Old', ylab='Num. Tables',
    +        xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat'),
    +        main='')
    > mtext(text=rev(names(bins)[-1L]), side=1, line=1, at=h$mids)
    

    enter image description here

    hist() invisibly throws stats, which is e.g. useful to extract counts to see how the bins are filled. We already used the 'mids' in mtext above.

    > h$counts |> setNames(rev(names(bins)[-1L]))
    120+   90   60   30 
    6182 2695 2264 4393  
    

    Inverse:

    > g <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)], 
    +              ref_date - bins[length(bins)]) |> 
    +   as.integer() |> base::`*`(-1L) |> 
    +   hist(breaks=-as.numeric(ref_date - bins), freq=TRUE, xlab='Days Old', ylab='Num. Tables',
    +        xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat', rev=TRUE),
    +        main='')
    > mtext(text=names(bins)[-1L], side=1, line=1, at=g$mids)
    > g$counts |> setNames(names(bins)[-1L])
      30   60   90 120+ 
    4393 2552 2407 6182 
    

    enter image description here


    Data:

    set.seed(42)
    n <- 50
    df <- data.frame(
      STATS_DATE=sample(seq(as.Date("2024-03-01"), as.Date("2024-10-03"), 
                            by="day"), n, replace=TRUE),
      NUM_TABLES=round(runif(n, 5, 746))
    ) |> sort_by(~STATS_DATE) |> `rownames<-`(NULL)