Good day,
I will present two [likely] very puny problems for your excellent review.
Problem #1
I have a relatively tidy df (dat) with dim 10299 x 563. The 563 variables common to both datasets [that created] dat are 'subject' (numeric), 'label' (numeric), 3:563 (variable names from a text file). Observations 1:2947 are from a 'test' dataset whereas observations 2948:10299 are from a 'training' dataset.
I'd like to insert a column (header = 'type') into dat that is basically rows 1:2947 comprised of string test and rows 2948:10299 of string train that way I can group later on dataset or other similar aggregate functions in dplyr/tidyr.
I created a test df (testdf = 1:10299: dim(testdf) = 102499 x 1) and then:
testdat[1:2947 , "type"] <- c("test")
testdat[2948:10299, "type"] <- c("train")
> head(ds, 2);tail(ds, 2)
X1.10299 type
1 1 test
2 2 test
X1.10299 type
10298 10298 train
10299 10299 train
So I really don't like that there is now a column of X1.10299.
Questions:
Problem #2
The way I arrived at my [nearly] tidy df (dat) from above was to two take dfs (test and train) of the form dim(2947 x 563 and 7352 x 563), respectively, and rbinding them together.
I confirm that all of my variable names are present after the binding effort by something like this:
test.names <- names(test)
train.names <- names(train)
identical(test.names, train.names)
> TRUE
What is interesting and of primary concern is that if I try to use the bind_rows function from 'dplyr' to perform the same binding exercise:
dat <- bind_rows(test, train)
It returns a dataframe that apparently keeps my all of my observations (x: 10299) but now my variable count is reduced from 563 to 470!
Question:
tidyr?
Thank you for your time and consideration of these matters.
Sample test/train dfs for review (the left most numeric are df indices):
test df test[1:10, 1:5]
subject labels tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1 2 5 0.2571778 -0.02328523 -0.01465376
2 2 5 0.2860267 -0.01316336 -0.11908252
3 2 5 0.2754848 -0.02605042 -0.11815167
4 2 5 0.2702982 -0.03261387 -0.11752018
5 2 5 0.2748330 -0.02784779 -0.12952716
6 2 5 0.2792199 -0.01862040 -0.11390197
7 2 5 0.2797459 -0.01827103 -0.10399988
8 2 5 0.2746005 -0.02503513 -0.11683085
9 2 5 0.2725287 -0.02095401 -0.11447249
10 2 5 0.2757457 -0.01037199 -0.09977589
train df train[1:10, 1:5]
subject label tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1 1 5 0.2885845 -0.020294171 -0.1329051
2 1 5 0.2784188 -0.016410568 -0.1235202
3 1 5 0.2796531 -0.019467156 -0.1134617
4 1 5 0.2791739 -0.026200646 -0.1232826
5 1 5 0.2766288 -0.016569655 -0.1153619
6 1 5 0.2771988 -0.010097850 -0.1051373
7 1 5 0.2794539 -0.019640776 -0.1100221
8 1 5 0.2774325 -0.030488303 -0.1253604
9 1 5 0.2772934 -0.021750698 -0.1207508
10 1 5 0.2805857 -0.009960298 -0.1060652
Actual Code (ignore the function calls/I'm doing most of the testing via console).
[http://archive.ics.uci.edu/ml/machine-learning-databases/00240/]The data set I'm using with this code. 1
run_analysis <- function () {
#Vars available for use throughout the function that should be preserved
vars <- read.table("features.txt", header = FALSE, sep = "")
lookup_table <- data.frame(activitynum = c(1,2,3,4,5,6),
activity_label = c("walking", "walking_up",
"walking_down", "sitting",
"standing", "laying"))
test <- test_read_process(vars, lookup_table)
train <- train_read_process(vars, lookup_table)
}
test_read_process <- function(vars, lookup_table) {
#read in the three documents for cbinding later
test.sub <- read.table("test/subject_test.txt", header = FALSE)
test.labels <- read.table("test/y_test.txt", header = FALSE)
test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
test.dat <- cbind(test.sub, test.labels, test.obs)
colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))
#Use lookup_table to set the "test_labels" string values that correspond
#to their integer IDs
#test.lookup <- merge(test, lookup_table, by.x = "labels",
# by.y ="activitynum", all.x = T)
#Remove temporary symbols from globalEnv/memory
rm(test.sub, test.labels, test.obs)
#return
return(test.dat)
}
train_read_process <- function(vars, lookup_table) {
#read in the three documents for cbinding
train.sub <- read.table("train/subject_train.txt", header = FALSE)
train.labels <- read.table("train/y_train.txt", header = FALSE)
train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
train.dat <- cbind(train.sub, train.labels, train.obs)
colnames(train.dat) <- c("subject", "label", as.character(vars[,2]))
#Clean up temporary symbols from globalEnv/memory
rm(train.sub, train.labels, train.obs, vars)
return(train.dat)
}
The problem that you're facing stems from the fact that you have duplicated names in the variable list that you're using to create your data frame objects. If you ensure that the column names are unique and shared between the objects the code will run. I've included a fully working example based on the code you used above (with fixes and various edits noted in the comments):
vars <- read.table(file="features.txt", header=F, stringsAsFactors=F)
## FRS: This is the source of original problem:
duplicated(vars[,2])
vars[317:340,2]
duplicated(vars[317:340,2])
vars[396:419,2]
## FRS: I edited the following to both account for your data and variable
## issues:
test_read_process <- function() {
#read in the three documents for cbinding later
test.sub <- read.table("test/subject_test.txt", header = FALSE)
test.labels <- read.table("test/y_test.txt", header = FALSE)
test.obs <- read.table("test/X_test.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
test.dat <- cbind(test.sub, test.labels, test.obs)
#colnames(test.dat) <- c("subject", "labels", as.character(vars[,2]))
colnames(test.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))
return(test.dat)
}
train_read_process <- function() {
#read in the three documents for cbinding
train.sub <- read.table("train/subject_train.txt", header = FALSE)
train.labels <- read.table("train/y_train.txt", header = FALSE)
train.obs <- read.table("train/X_train.txt", header = FALSE, sep = "")
#cbind the cols together and set remaining colNames to var names in vars
train.dat <- cbind(train.sub, train.labels, train.obs)
#colnames(train.dat) <- c("subject", "labels", as.character(vars[,2]))
colnames(train.dat) <- c("subject", "labels", paste0("V", 1:nrow(vars)))
return(train.dat)
}
test_df <- test_read_process()
train_df <- train_read_process()
identical(names(test_df), names(train_df))
library("dplyr")
## FRS: These could be piped together but I've kept them separate for clarity:
train_df %>%
mutate(test="train") ->
train_df
test_df %>%
mutate(test="test") ->
test_df
test_df %>%
bind_rows(train_df) ->
out_df
head(out_df)
out_df
## FRS: You can set your column names to those of the original
## variable list but you still have duplicates to deal with:
names(out_df) <- c("subject", "labels", as.character(vars[,2]), "test")
duplicated(names(out_df))