I'm reading in data from a csv file where each row contains some number of individual strings:
e.g.
data.csv ->
x,f,t,h,b,g
d,g,h
g,h,a,s,d
f
q,w,e,r,t,y,u,i,o
data <- read.csv("data.csv", header = FALSE)
I want to transform this input into a data frame where the columns are the set of unique strings present in the input. In this case, the columns would be the set of strings {x,f,t,h,b,g,d,a,s,q,w,e,r,y,u,i,o}
. Additionally, the new data frame should contain a row for each row in the input data frame such that a column will have the value 1
if the column's name was present in that row in the input data frame, or 0
if the column's name was not present in that input row.
In this example, the desired output would be the following:
x f t h b g d a s q w e r y u i o
----------------------------------
1 | 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
2 | 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
3 | 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
4 | 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 | 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
The code below is what I currently have. However, the output df
ends up being a data frame with what appear to be the correct columns, but 0 rows.
I'm very inexperienced at R, and this was my attempt at putting together something that works. It seems to work as expected up until the call to apply()
, which unexpectedly doesn't add anything to df
.
data <- read.csv("data.csv", header = FALSE)
columnNames = c()
for (row in data) {
for (eventName in row) {
if (!(eventName %in% columnNames)) {
columnNames = c(columnNames, eventName)
}
}
}
columnNames = t(columnNames)
df = data.frame(columnNames)
colnames(df) = columnNames
df = df[-1,]
apply(data, 1, function(row, df) {
dat = data.frame(columnNames)
colnames(dat) = columnNames
dat = dat[-1,]
for (eventName in row) {
if (eventName != "") {
dat[1,eventName] = 1
}
}
df = rbind(df, dat)
}, df)
After the script finishes it tells me there were many warnings of the following two forms:
9: In `[<-.factor`(`*tmp*`, iseq, value = 1) : invalid factor level, NA generated
10: In `[<-.factor`(`*tmp*`, iseq, value = 1) :
invalid factor level, NA generated
We can use mtabulate
after split
ting the column by ,
library(qdapTools)
mtabulate(strsplit(as.character(df1[,1]), ","))
Or with base R
methods by split
ting the column by ,
, set the names of the list
output as the sequence of rows, convert the list
to data.frame
(stack
), change the 'values' column to factor
with levels
specified and then get the frequency with table
.
table(transform(stack(setNames(strsplit(as.character(df1[,1]), ","), 1:nrow(df1)))[2:1],
values = factor(values, levels = unique(values))))
#
# x f t h b g d a s q w e r y u i o
# 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
# 2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
# 3 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
# 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 5 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
If this is not a single column,
mtabulate(apply(df2, 1, FUN = function(x) x[x!=""]))
Or
as.data.frame.matrix(table(transform(stack(setNames(apply(df2, 1,
FUN = function(x) x[x!=""]),
1:nrow(df2)))[2:1], values = factor(values, levels = unique(values)))))
#
# x f t h b g d a s q w e r y u i o
# 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
# 2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
# 3 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
# 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 5 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
df1 <- structure(list(V1 = c("x,f,t,h,b,g", "d,g,h", "g,h,a,s,d", "f",
"q,w,e,r,t,y,u,i,o")), .Names = "V1", class = "data.frame",
row.names = c(NA, -5L))
df2 <- structure(list(v1 = c("x", "d", "g", "f", "q"), v2 = c("f", "g",
"h", "", "w"), v3 = c("t", "h", "a", "", "e"), v4 = c("h", "",
"s", "", "r"), v5 = c("b", "", "d", "", "t"), v6 = c("g", "",
"", "", "y"), v7 = c("", "", "", "", "u"), v8 = c("", "", "",
"", "i"), v9 = c("", "", "", "", "o")), .Names = c("v1", "v2",
"v3", "v4", "v5", "v6", "v7", "v8", "v9"), row.names = c(NA,
-5L), class = "data.frame")