Search code examples
rdataframedplyrrbindcbind

Create a character vector column of predefined text and bind it to existing dataframe using rbind or bind_rows


Good day,

I will present two [likely] very puny problems for your excellent review.

Problem #1

I have a relatively tidy df (dat) with dim 10299 x 563. The 563 variables common to both datasets [that created] dat are 'subject' (numeric), 'label' (numeric), 3:563 (variable names from a text file). Observations 1:2947 are from a 'test' dataset whereas observations 2948:10299 are from a 'training' dataset.

I'd like to insert a column (header = 'type') into dat that is basically rows 1:2947 comprised of string test and rows 2948:10299 of string train that way I can group later on dataset or other similar aggregate functions in dplyr/tidyr.

I created a test df (testdf = 1:10299: dim(testdf) = 102499 x 1) and then:

testdat[1:2947 , "type"] <- c("test")
testdat[2948:10299, "type"] <- c("train")
> head(ds, 2);tail(ds, 2)
  X1.10299 type
1        1 test
2        2 test
      X1.10299  type
10298    10298 train
10299    10299 train

So I really don't like that there is now a column of X1.10299.

Questions:

  • Is there a better and more expedient way to create a column that has what I'm looking for based upon my use case above?
  • What is a good way to actually insert that column into 'dat' so that I can use it later for grouping with dplyr?

Problem #2

The way I arrived at my [nearly] tidy df (dat) from above was to two take dfs (test and train) of the form dim(2947 x 563 and 7352 x 563), respectively, and rbinding them together.

I confirm that all of my variable names are present after the binding effort by something like this:

test.names <- names(test)
train.names <- names(train)
identical(test.names, train.names)
> TRUE

What is interesting and of primary concern is that if I try to use the bind_rows function from 'dplyr' to perform the same binding exercise:

dat <- bind_rows(test, train)

It returns a dataframe that apparently keeps my all of my observations (x: 10299) but now my variable count is reduced from 563 to 470!

Question:

  • Does anyone know why my variables are being chopped?
  • Is this the best way to combine two dfs of the same structure for later slicing/dicing with dplyr/

tidyr?

Thank you for your time and consideration of these matters.

Sample test/train dfs for review (the left most numeric are df indices):

test df test[1:10, 1:5]

   subject labels tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1        2      5         0.2571778       -0.02328523       -0.01465376
2        2      5         0.2860267       -0.01316336       -0.11908252
3        2      5         0.2754848       -0.02605042       -0.11815167
4        2      5         0.2702982       -0.03261387       -0.11752018
5        2      5         0.2748330       -0.02784779       -0.12952716
6        2      5         0.2792199       -0.01862040       -0.11390197
7        2      5         0.2797459       -0.01827103       -0.10399988
8        2      5         0.2746005       -0.02503513       -0.11683085
9        2      5         0.2725287       -0.02095401       -0.11447249
10       2      5         0.2757457       -0.01037199       -0.09977589

train df train[1:10, 1:5]

   subject label tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1        1     5         0.2885845      -0.020294171        -0.1329051
2        1     5         0.2784188      -0.016410568        -0.1235202
3        1     5         0.2796531      -0.019467156        -0.1134617
4        1     5         0.2791739      -0.026200646        -0.1232826
5        1     5         0.2766288      -0.016569655        -0.1153619
6        1     5         0.2771988      -0.010097850        -0.1051373
7        1     5         0.2794539      -0.019640776        -0.1100221
8        1     5         0.2774325      -0.030488303        -0.1253604
9        1     5         0.2772934      -0.021750698        -0.1207508
10       1     5         0.2805857      -0.009960298        -0.1060652

Actual Code (ignore the function calls/I'm doing most of the testing via console).

[http://archive.ics.uci.edu/ml/machine-learning-databases/00240/]The data set I'm using with this code. 1

run_analysis <- function () {
    #Vars available for use throughout the function that should be preserved
    vars <- read.table("features.txt", header = FALSE, sep = "")
    lookup_table <- data.frame(activitynum = c(1,2,3,4,5,6), 
                               activity_label = c("walking", "walking_up", 
                                                  "walking_down", "sitting", 
                                                  "standing", "laying"))
    test <- test_read_process(vars, lookup_table)
    train <- train_read_process(vars, lookup_table)
}

test_read_process <- function(vars, lookup_table) {
    #read in the three documents for cbinding later
    test.sub <- read.table("test/subject_test.txt", header = FALSE)
    test.labels <- read.table("test/y_test.txt", header = FALSE)
    test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")

    #cbind the cols together and set remaining colNames to var names in vars
    test.dat <- cbind(test.sub, test.labels, test.obs)  
    colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))

    #Use lookup_table to set the "test_labels" string values that correspond
    #to their integer IDs
    #test.lookup <- merge(test, lookup_table, by.x = "labels", 
    #               by.y ="activitynum", all.x = T)

    #Remove temporary symbols from globalEnv/memory
    rm(test.sub, test.labels, test.obs)

    #return
    return(test.dat)
}

train_read_process <- function(vars, lookup_table) {
    #read in the three documents for cbinding
    train.sub <- read.table("train/subject_train.txt", header = FALSE)
    train.labels <- read.table("train/y_train.txt", header = FALSE)
    train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")

    #cbind the cols together and set remaining colNames to var names in vars
    train.dat <- cbind(train.sub, train.labels, train.obs)    
    colnames(train.dat) <- c("subject", "label", as.character(vars[,2]))

    #Clean up temporary symbols from globalEnv/memory
    rm(train.sub, train.labels, train.obs, vars)

    return(train.dat)
}

Solution

  • The problem that you're facing stems from the fact that you have duplicated names in the variable list that you're using to create your data frame objects. If you ensure that the column names are unique and shared between the objects the code will run. I've included a fully working example based on the code you used above (with fixes and various edits noted in the comments):

    vars <- read.table(file="features.txt", header=F, stringsAsFactors=F)
    
    ##  FRS: This is the source of original problem:
    duplicated(vars[,2])
    vars[317:340,2]
    duplicated(vars[317:340,2])
    vars[396:419,2]
    
    ##  FRS: I edited the following to both account for your data and variable
    ##    issues:
    test_read_process <- function() {
      #read in the three documents for cbinding later
      test.sub <- read.table("test/subject_test.txt", header = FALSE)
      test.labels <- read.table("test/y_test.txt", header = FALSE)
      test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")
    
      #cbind the cols together and set remaining colNames to var names in vars
      test.dat <- cbind(test.sub, test.labels, test.obs)  
      #colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))
      colnames(test.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))
    
      return(test.dat)
    }
    
    train_read_process <- function() {
      #read in the three documents for cbinding
      train.sub <- read.table("train/subject_train.txt", header = FALSE)
      train.labels <- read.table("train/y_train.txt", header = FALSE)
      train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")
    
      #cbind the cols together and set remaining colNames to var names in vars
      train.dat <- cbind(train.sub, train.labels, train.obs)    
      #colnames(train.dat) <- c("subject", "labels", as.character(vars[,2]))
      colnames(train.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))
    
      return(train.dat)
    }
    
    
    test_df <- test_read_process()
    train_df <- train_read_process()
    
    identical(names(test_df), names(train_df))
    
    
    library("dplyr")
    
    ## FRS: These could be piped together but I've kept them separate for clarity:
    train_df %>%
      mutate(test="train") -> 
      train_df
    
    test_df %>%
      mutate(test="test") -> 
      test_df
    
    test_df %>% 
      bind_rows(train_df) -> 
      out_df
    
    head(out_df)
    out_df
    
    ##  FRS: You can set your column names to those of the original 
    ##    variable list but you still have duplicates to deal with:
    names(out_df) <- c("subject", "labels", as.character(vars[,2]), "test")
    
    duplicated(names(out_df))