Search code examples
daterangefilteringzoo

Filter out a date range from a yearmon variable in R


The yearmon class from the zoo package seems particularly useful for working with data that only has a year and month, but no day or time. However, I find it difficult to work with, particularly when trying to filter out a range of dates.

yearmon is kind of a weird data class. It displays dates like Jun 2002. This is more aesthetically pleasing, but (correct me if I'm wrong) underneath it's the year and a fraction of the month, with January being 0 and December being 11. So, Jun 2002, for example, would actually be 2002.417 (i.e., 5/12) under the hood.

Let's assume now that you have monthly data over a ten year period (2000-2009), kinda like this.

       date  value
   <yearmon> <dbl>
 1 Jan 2000  10000
 2 Feb 2000  30000
 3 Mar 2000    250
 4 Apr 2000     20
 5 May 2000  50000
 6 Jun 2000 -90042
 7 Jul 2000  73400
 8 Aug 2000   4317
 9 Sep 2000   1000
10 Oct 2000    -22

You want to filter out rows that fall between June 2002 and November 2003 and keep everything else in your data set.

%in% is good for keeping only a specific range of values, but I often find taking the inverse of this function just as useful, like the following.

%!in% <- Negate(%in%)

My understanding is that you can't work with what yearmon displays. Meaning, you can't actually reference dates like "Jun 2003", but instead you need to work with what's stored under the hood (i.e., 2002.417). I would think it'd be sufficient to use filter() from dplyr to filter out the date range through something like the following

my_data |>
  filter(date %!in% c(2002.417:2003.833))

However, I find that this doesn't seem to work, as the months from June 2002 to November 2003 won't be removed. I think it might have something to do with the colon not working with yearmon somehow? I tried only removing June 2002 (i.e., removed the colon and 2003.833) and it was able to remove that single date. However, I find it doesn't like it when you try to specify a range of dates to remove. If the dates were numeric, say, written as 200206 and 200311, I've found that running my above code works in removing that period.

Does anyone know how to filter out a range of dates in class yearmon? I'm surprised that there doesn't seem to be much stuff about filtering out date ranges (usually I find online tutorials only show how to keep a specific range, rather than exclude).

Also, because I don't see a lot of discussion on it either, what are peoples thoughts on the yearmon class from zoo? Do you find it useful? Many tutorials only seem to use yearmon for extracting month or year into a separate column. Am I wrong from keeping dates with only a month and year in class yearmon? Is it better to store them as class date and just add 01 at the end for the day so it's like 2002-06-01?

I've tried using !=, instead of %!in%. I also tried feeding in c("Jun 2002": "Nov 2003") but I get an NA/NAN error message. I've also tried converting it to a date class and using c("2002-06-01:"2003-11-01") but I think I also got a NA/NAN error, even after dropping the quotation marks and swapping the hyphens for slashes c("2002/06/01:"2003/11/01"). I also tried dropping the concatenate sign and brackets in case the numeric date values were sufficient, but that didn't work either. I also tried playing with the Negate of the between() function, but found that it didn't work and was making the code more complicated than it ought to be.


Solution

  • I have never used yearmon before, but I believe your error is stemming from the use of :. Using a colon like 2000:2003 will produce integer values from 2000 to 2003. Using non-integer values increments by 1 until the end point is exceeded. This would not produce the desired matches in your situation.

    c(2002.417:2003.833)
    [1] 2002.417 2003.417
    

    The function you are looking for is seq().

     x <- seq(from = 2002.417, to = 2003.833, by = 1/12)
     x
     [1] 2002.417 2002.500 2002.584 2002.667 2002.750 2002.834 2002.917 2003.000 2003.084
    [10] 2003.167 2003.250 2003.334 2003.417 2003.500 2003.584 2003.667 2003.750
    
     as.yearmon(x)
     [1] "Jun 2002" "Jul 2002" "Aug 2002" "Sep 2002" "Oct 2002" "Nov 2002" "Dec 2002"
     [8] "Jan 2003" "Feb 2003" "Mar 2003" "Apr 2003" "May 2003" "Jun 2003" "Jul 2003"
    [15] "Aug 2003" "Sep 2003" "Oct 2003"
    

    You'll notice Nov 2003 (or 2003+((11-1)/12) = 2003.834) is missing. This is because seq() cannot exceed the to parameter. Increasing the to parameter by a small amount will remedy that issue.

    Using the appropriate seq() call in place of c() in your filter() should return the correct subset of data.