Search code examples
rdataframedesign-patternsrowanalysis

How to recognize unknown patterns in data frame by row?


I have a data frame where I have agricultural use codes (1-5) for 15 consecutive years. Each row is a polygon representing a field. Ultimately I need R to loop through the rows and recognize patterns of use and tell me their respective frequency. Unfortunately in my real data set I have over 1 mio. features and thus all possible patterns are not known.

a <- data.frame(replicate(15, sample(0:5,500,rep=TRUE)))
colnames(a) <- paste0("use",2005:2019)
id <- c(1:500)
a <- cbind(id,a)

id use2005 use2006 use2007 use2008 use2009 use2010 use2011 use2012 use2013 use2014 use2015 ...
1  1       1       1       1       1       2       2       1       4       4       4       ...
2  4       4       4       4       5       5       5       0       5       5       5       ...
3  1       4       3       2       3       2       4       5       1       1       1       ...
4  1       1       1       1       1       2       2       1       4       4       4       ...
5  4       2       2       2       2       5       3       3       3       3       3       ...

So in this arbitrary example, the code should recognize that id 1 & 4 have the same pattern.

In the end I imagine the result to be some sort of frequency distribution to see if there are certain patterns in the agricultural use of my fields.

For example:

1 1 1 1 1 2 1 1 1 3 2 4 1 1 1

[50] - occurs 50 times

5 5 5 5 5 1 1 1 1 4 4 4 2 2 3

[35] - occurs 35 times

and so forth with all existing combinations...

Unfortunately I have no idea how to approach this. I have no experience with pattern recognition.

Thank you!


Solution

  • maybe this?

    library(tidyverse)
    a[, -1] %>% group_by_all %>% count
    #  use2005 use2006 use2007 use2008 use2009 use2010 use2011 use2012 use2013 use2014 use2015     n
    #     <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int> <int>
    # 1       1       1       1       1       1       2       2       1       4       4       4     2
    # 2       1       4       3       2       3       2       4       5       1       1       1     1
    # 3       4       2       2       2       2       5       3       3       3       3       3     1
    # 4       4       4       4       4       5       5       5       0       5       5       5     1
    

    or if you want to include fields you could change to group_by_at and exclude id from the grouping and then paste fields together:

    a %>%
      group_by_at(vars(-id)) %>%
      summarise(n = n(), ids = paste(id, collapse= "," ))
    #   use2005 use2006 use2007 use2008 use2009 use2010 use2011 use2012 use2013 use2014 use2015     n ids  
    #     <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int> <int> <chr>
    # 1       1       1       1       1       1       2       2       1       4       4       4     2 1,4  
    # 2       1       4       3       2       3       2       4       5       1       1       1     1 3    
    # 3       4       2       2       2       2       5       3       3       3       3       3     1 5    
    # 4       4       4       4       4       5       5       5       0       5       5       5     1 2