My dataframe Expenses
is as shown below :
date name expenditure type
23MAR2013 KOSH ENTRP 4000 COMPANY
23MAR2013 JOHN DOE 800 INDIVIDUAL
24MAR2013 S KHAN 300 INDIVIDUAL
24MAR2013 JASINT PVT LTD 8000 COMPANY
25MAR2013 KOSH ENTRPRISE 2000 COMPANY
25MAR2013 JOHN S DOE 220 INDIVIDUAL
25MAR2013 S KHAN 300 INDIVIDUAL
26MAR2013 S KHAN 300 INDIVIDUAL
Earlier, I had identified the presence of repetitive names and patterns from the name
column and stored it in a vector NameVector
and it is as shown below.
KOSH JOHN DOE KHAN JASINT
My question is, how do I match each and every string pattern of Expenses$name
with the vector NameVector
and print it in a categorical way in the main data-frame?
date name expenditure type category
23MAR2013 KOSH ENTRP 4000 COMPANY KOSH
23MAR2013 JOHN DOE 800 INDIVIDUAL JOHN DOE
24MAR2013 S KHAN 300 INDIVIDUAL KHAN
24MAR2013 JASINT PVT LTD 8000 COMPANY JASINT
25MAR2013 KOSH ENTRPRISE 2000 COMPANY KOSH
25MAR2013 JOHN S DOE 220 INDIVIDUAL JOHN DOE
25MAR2013 SALM KHAN 300 INDIVIDUAL KHAN
26MAR2013 S KHAN 300 INDIVIDUAL KHAN
I tried splitting the column name
by every possible delimiter (spaces, |, *, commas etc) using strsplit()
to get the different parts of the names into different columns and try matching the patterns using agrep()
but I am not getting the desired output. Further introspection into the data, I have noticed that there were leading whitespaces and got rid of them, still no clue why I am not getting the output as show above.
The csv for the above table :
"Date","name","expenditure","type"
"23MAR2013","KOSH ENTRP",4000,"COMPANY"
"23MAR2013 ","JOHN DOE",800,"INDIVIDUAL"
"24MAR2013","S KHAN",300,"INDIVIDUAL"
"24MAR2013","JASINT PVT LTD",8000,"COMPANY"
"25MAR2013","KOSH ENTRPRISE",2000,"COMPANY"
"25MAR2013","JOHN S DOE",220,"INDIVIDUAL"
"25MAR2013","S KHAN",300,"INDIVIDUAL"
"26MAR2013","S KHAN",300,"INDIVIDUAL"
and the names vector that has been calculated/identifies as
NameVector <- c("KOSH","JOHN DOE","KHAN","JASINT")
You could try
library(stringi)
pat <- paste(unlist(strsplit(NameVector, ' ')), collapse="|")
Expenses$category <- vapply(stri_extract_all_regex(Expenses$name, pat),
paste, collapse=' ', character(1L))
Expenses
# date name expenditure type category
#1 23MAR2013 KOSH ENTRP 4000 COMPANY KOSH
#2 23MAR2013 JOHN DOE 800 INDIVIDUAL JOHN DOE
#3 24MAR2013 S KHAN 300 INDIVIDUAL KHAN
#4 24MAR2013 JASINT PVT LTD 8000 COMPANY JASINT
#5 25MAR2013 KOSH ENTRPRISE 2000 COMPANY KOSH
#6 25MAR2013 JOHN S DOE 220 INDIVIDUAL JOHN DOE
#7 25MAR2013 S KHAN 300 INDIVIDUAL KHAN
#8 26MAR2013 S KHAN 300 INDIVIDUAL KHAN