Search code examples
rdataframemergedplyrrow-number

Creating a dataframe by using row numbers of other data frames in R


I would like to know that whether there is a way in R for using datasets with similar names. For example:

I have eight datasets named cars_1995, cars_new_1995, cars_1996, cars_new_1996, cars_1997, cars_new_1997, cars_1998 and cars_new_1998 and they have different dimensions but only thing that I want to use is their row numbers.

So I want to create a new data frame and rownames will be the years (1995,1996,1997,1998) and colnames will be the groups (cars, cars_new). the dimension of the data frame will be 4:2. The values of the dataframe will be the rownumbers such as:

       Cars  Cars_new
  1995  25  76 
  1996  38  35 
  1997  87  49 
  1998  34  50

First question is, in the R, is there a way to use the names of dataframes like I need to take row numbers of all data frames from 1995 to 1999 (one column for cars_YEAR, the other column for cars_YEAR_new). By the way only year indicator is the names of the data frames.

I'm OK with how to build a data frame or how to get the rownumbers (nrow) However, I haven't figured out how to pull all datasets by just using embed years of the dataframes' names consecutively.So it would be great how to get all cars_YEAR datasets row numbers by using the year extension in the datasets names.


Solution

  • Solution using stringr + dplyr + tidyr:

    library(stringr)
    library(dplyr)
    library(tidyr)
    
    ls() %>%
      str_extract("^cars_\\d{4}.*") %>%
      na.omit() %>%
      mget(envir = globalenv()) %>%
      lapply(nrow) %>%
      data.frame() %>%
      gather(key, value) %>%
      mutate(key = str_replace(key, "(\\d{4})_(.+)", "\\2_\\1")) %>%
      separate(key, c("key", "year"), "_(?=\\d)") %>%
      spread(key, value)
    

    Result:

      year cars cars_new
    1 1995  114      102
    2 1996  113       61
    3 1997  168      186
    4 1998  196       68
    

    Notes:

    1. ls() is handy for this example as it can extract all object names in the global environment.
    2. Using str_extract, I extracted only object names that starts with cars_[4 digits] then fed the result into mget
    3. mget takes a vector of character object names and returns the values from a specified environment in a named list
    4. lapply is used to extract row numbers in each element (dataset) of the list. This is then converted to a data.frame with each column corresponding to a row number.
    5. gather reshapes the data.frame from wide to long format.
    6. The regular expression in mutate and separate are respectively to change names from cars_1995_news to cars_new_1995 and separate into a column of cars_new and year
    7. The final spread reshapes the data.frame to the desired format.

    Note about data:

    I used rnorm and sample to create the toy datasets and did not set a seed for each of them. This means you will not get the same resulting data.frame as what I have here. This is ok as what the row numbers actually are is not important to solve OP's problem.

    Data:

    # Create sample datasets
    cars_1995 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_1995_new = data.frame(A = rnorm(sample(50:200, 1)))
    cars_1996 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_1996_new = data.frame(A = rnorm(sample(50:200, 1)))
    cars_1997 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_1997_new = data.frame(A = rnorm(sample(50:200, 1)))
    cars_1998 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_1998_new = data.frame(A = rnorm(sample(50:200, 1)))
    

    Edits (dataset names from cars_YEAR_new to cars_new_YEAR):

    ls() %>%
      str_extract("^cars_(new_)*\\d{4}") %>%
      na.omit() %>%
      mget(envir = globalenv()) %>%
      lapply(nrow) %>%
      data.frame() %>%
      gather(key, value) %>%
      separate(key, c("key", "year"), "_(?=\\d)") %>%
      spread(key, value)
    

    With the new dataset names, the mutate + str_replace step can be removed, and the str_extract step also has to be changed, since the pattern is now different.

    New Data:

    # Create sample datasets
    cars_1995 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_new_1995 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_1996 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_new_1996 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_1997 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_new_1997 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_1998 = data.frame(A = rnorm(sample(50:200, 1)))
    cars_new_1998 = data.frame(A = rnorm(sample(50:200, 1)))