I´ve got a data frame with a column consisting of Strings (project_skills) which denotes the skills a certain job (job_id) affords. I want to split this string for every job to get an vector of the skills a job affords and then create a Document Term Matrix to denote which skills (out of every possible skill) a certain job affords.
I´ve got the following data frame:
job_id project_skills
107182 CSS,HTML,Joomla,PHP
108169 XTCommerce,Magento,Prestashop,VirtueMart,osCommerce
112969 Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#
114660 Marketing,Email Marketing
118686 PHP
The result should look anything like this (which is basically a Document Term Matrix with comma separated phrases:
project_skills
job_id CSS HTML PHP Google Search Console Google Analytics Java ...
107182 1 0 0 ...
108169 0 0 0 0 0
112969 0 0 0 1 1 ...
114660 0 0 0 ...
118686 0 0 1 ...
I´ve tried the following:
df <- data.frame(job_id = c(107182, 108169, 112969, 114660, 118686), project_skills = c("CSS,HTML,Joomla,PHP", "XTCommerce,Magento,Prestashop,VirtueMart,osCommerce", "Google Search Console,Google Analytics,Google Webmaster Central,C++,Java,C#", "Marketing,Email Marketing", "PHP"))
corpus <- Corpus(VectorSource(df$project_skills))
corpus <- tm_map(corpus, function(x) {
PlainTextDocument(
strsplit(x,"\\,")[[1]],
id=ID(x)
)
})
inspect(corpus)
dtm <- DocumentTermMatrix(corpus)
as.matrix(dtm)
But unfortunately this splits all words and not at the commas (e.g. Google Search Console should be treated as one Term in the DTM).
tm (or some other text mining packages) split on words (spaces) and if you don't check tend to remove punctuation like + and #. The simplest option is just to use strsplit
. I show an option below using tidyr and dplyr. First group by job_id, next split the columns. This will create a nesting, which when unnested creates a long data.frame. Here I add the value 1 for every entry which acts like a 1 in a document term matrix. Then spread into a wide format to get your expected output. If you look at the resulting structure the column names are what you expect, not showing the tilde (~).
library(tidyr)
library(dplyr)
outcome <- df1 %>%
group_by(job_id) %>%
mutate(project_skills = strsplit(project_skills, ",")) %>%
unnest() %>%
mutate(value = 1) %>% # add 1 for every value
spread(key = project_skills, value = value) # use fill = 0 if you don't want NA's
head(outcome)
# A tibble: 5 x 18
# Groups: job_id [5]
job_id `C#` `C++` CSS `Email Marketin~ `Google Analyti~ `Google Search ~ `Google Webmast~ HTML Java Joomla Magento Marketing
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 107182 NA NA 1 NA NA NA NA 1 NA 1 NA NA
2 108169 NA NA NA NA NA NA NA NA NA NA 1 NA
3 112969 1 1 NA NA 1 1 1 NA 1 NA NA NA
4 114660 NA NA NA 1 NA NA NA NA NA NA NA 1
5 118686 NA NA NA NA NA NA NA NA NA NA NA NA
# ... with 5 more variables: osCommerce <dbl>, PHP <dbl>, Prestashop <dbl>, VirtueMart <dbl>, XTCommerce <dbl>