Search code examples
rdata-structuresnlptidyverse

How to find pattern match in the occurrence of two separate columns in R


I have a dataset where there are two columns: Names and Age. It is a very big dataset and it looks something like the table below:

Name: A, A, A, B, B, E, E, E, E, E

Age: 10, 10, 10, 15, 14, 20, 20, 20, 19

I want to find out how many times it appears that these two columns, Name and Age, are not co-occurring. Basically, how many times it is identifying that the names of the people and age matches, for instance it could happen that B who is 15 years old and the one with age 14 years are different.


Solution

  • If I understand the question, you're looking to see how many different ages each name has in the data.

    One dplyr approach would be to identify those distinct combinations of age and name, and then count by name. This tells us A has only one age, while B and E each have two.

    library(dplyr)
    my_data %>%
      distinct(name, age) %>%
      count(name)
    
      name n
    1    A 1
    2    B 2
    3    E 2
    

    If you want more info about what those combinations are, you could use add_count to keep all the combinations, plus the count by name.

    my_data %>%
      distinct(name, age) %>%
      add_count(name)
    
      name age n
    1    A  10 1
    2    B  15 2
    3    B  14 2
    4    E  20 2
    5    E  19 2
    

    Sample data

    Please note, it is best practice to include in your question the code to generate a specific sample data object. This reduces redundant work for people who want to help you, and reduces ambiguity (e.g. in your example there aren't as many ages as names).

    my_data <- data.frame(
               name = c("A", "A", "A", "B", "B", "E", "E", "E", "E", "E"),
               age = c(10, 10, 10, 15, 14, 20, 20, 20, 19, 20))