I am looking to create a histogram of aging. By aging, I'm referring to the accounting report of tracking an asset like accounts payable into bins of 30 days, 60 days, 90 days, and 120+ days old.
In my case, however, I need to track how many days has it been since statistics were last created for a table in a database using the same bins as accountants do. I hope that makes sense.
The data collected by my company's script has 2 variables of unpredictable amounts of observations. The two variables are NUM_TABLES (this is the number of tables updated) and STATS_DATE (this is the YYYY-MM-DD that statistics were last updated on that amount of tables).
> head(df)
STATS_DATE NUM_TABLES
1 20210908 5
2 20240814 193
3 20240815 746
I would like to report this information in a histogram using the bins 30, 60, 90, 120+. The final chart should look something like the following idealized graph that doesn't represent the sample data above:
I am able to calculate the number of days from the target date.
# target date formatted to database's timezone
DATE <- parse_date_time("2024-08-16", "ymd", tz = "US/Central")
# calculate difference between target date and date column in days
df$DAYS <- as.numeric(DATE - df$STATS_DATE)
What I can't seem to do is bring it all together using NUM_TABLES as the frequency.
Any help is greatly appreciated.
I've tried using R base hist function as well as the ggplot2 function. I've researched sites like StackOverflow, Statology, and others for various features and key words. Though, being new I do struggle with my lack of understanding for the R terminology.
UPDATE: Here is the output from my data as requested by Edward.
> dput(df)
structure(list(STATS_DATE = c(20210908L, 20240814L, 20240815L
), NUM_TABLES = c(5L, 193L, 746L)), class = "data.frame", row.names = c(NA, 3L))
> sapply(df, mode)
STATS_DATE NUM_TABLES
"numeric" "numeric"
I did apply the tag bar chart to this post because I was not sure if a histogram would be possible. Edward's suggestion is perfectly valid, and I really appreciate the input.
At this point I'm having some trouble with applying the histogram.
I believe my issue is with the STATS_DATE column.
When I use the line stats_dates <- stats_dates[stats_dates <= ref_date]
it deletes all the data. When I remove that line I get the following error message: Error in hist.default(unclass(x), unclass(breaks), plot = FALSE, warn.unused = FALSE, : some 'x' not counted; maybe 'breaks' do not span range of 'x'
.
Here is the code based on the responses, I know I'm doing something wrong prior to ref_date <- as.Date("2024-08-16")
.
raw <- read.csv("data/stats.del", header=FALSE, sep=",")
df <- data.frame(na.omit(raw)) ## remove rows from temp tables
colnames(df) <- c('STATS_DATE','NUM_TABLES')
# convert date column to correct timezone and format
df$STATS_DATE <- parse_date_time(df$STATS_DATE, "ymd", tz = "US/Central") ## I think my issue is here
ref_date <- as.Date("2024-08-16")
stats_dates <- Map(rep, df$STATS_DATE, df$NUM_TABLES) |> unlist() |> as.Date()
# create the bins
bins <- seq.int(0, 120, length.out=5L)
names(bins) <- replace(bins, length(bins), paste0(bins[length(bins)], '+'))
# check bins
bins
# create the histogram
h <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)],
ref_date - bins[length(bins)]) |>
hist(breaks=ref_date - bins, freq=TRUE, xlab='Days Old', ylab='Num. Tables',
xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat', rev=TRUE),
main='')
mtext(text=names(bins)[-1L], side=1, line=1, at=h$mids)
# check bin counts
h$counts |> setNames(names(bins)[-1L])
First, you could use rep
in Map
to expand the date tables in a date vector stats_dates
.
> ref_date <- as.Date("2024-08-16")
>
> stats_dates <- Map(rep, df$STATS_DATE, df$NUM_TABLES) |> unlist() |> as.Date()
> stats_dates <- stats_dates[stats_dates <= ref_date] ## delete everything after ref_date
Define the bins.
> bins <- seq.int(0, 120, length.out=5L)
> ## naming the bins makes life easier:
> names(bins) <- replace(bins, length(bins), paste0(bins[length(bins)], '+'))
> bins
0 30 60 90 120+
0 30 60 90 120
Next, use replace
to censor dates before ref_date - 120 days at ref_date - 120
to get'em in the same bin (aka breaks). As histogram-breaks=
then use ref_date - bins
. When using hist()
on a "Date"
object, graphics:::hist.Date
is dispatched.
> h <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)],
+ ref_date - bins[length(bins)]) |>
+ hist(breaks=ref_date - bins, freq=TRUE, xlab='Days Old', ylab='Num. Tables',
+ xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat'),
+ main='')
> mtext(text=rev(names(bins)[-1L]), side=1, line=1, at=h$mids)
hist()
invisibly throws stats, which is e.g. useful to extract counts to see how the bins are filled. We already used the 'mids'
in mtext
above.
> h$counts |> setNames(rev(names(bins)[-1L]))
120+ 90 60 30
6182 2695 2264 4393
Inverse:
> g <- replace(stats_dates, stats_dates <= ref_date - bins[length(bins)],
+ ref_date - bins[length(bins)]) |>
+ as.integer() |> base::`*`(-1L) |>
+ hist(breaks=-as.numeric(ref_date - bins), freq=TRUE, xlab='Days Old', ylab='Num. Tables',
+ xaxt='n', las=1, col=hcl.colors(length(bins) - 1L, 'heat', rev=TRUE),
+ main='')
> mtext(text=names(bins)[-1L], side=1, line=1, at=g$mids)
> g$counts |> setNames(names(bins)[-1L])
30 60 90 120+
4393 2552 2407 6182
Data:
set.seed(42)
n <- 50
df <- data.frame(
STATS_DATE=sample(seq(as.Date("2024-03-01"), as.Date("2024-10-03"),
by="day"), n, replace=TRUE),
NUM_TABLES=round(runif(n, 5, 746))
) |> sort_by(~STATS_DATE) |> `rownames<-`(NULL)