I have following sort of data set(ds1) in my CSV file that includes class Name and corresponding faults. I intend to extract or filter Package Name from the data having number of faults equal to 2 using R script.
Class Faults
org.apache.tools.ant.taskdefs.Definer 2
org.apache.tools.ant.taskdefs.Definer 2
org.apache.tools.ant.taskdefs.Delete 1
org.apache.tools.ant.taskdefs.Deltree 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.Ear 2
org.apache.tools.ant.taskdefs.Ear 2
org.apache.tools.ant.taskdefs.Echo 1
org.apache.tools.ant.Exec 2
org.apache.tools.ant.Exec 2
I have written following code, but, it does not produce desired output
dschanged<- subset(ds1, grep( "/^([^\\.]+)/", class) & Faults==2 )
Technically, I require proper regular expression to pull the string before last dot(.) to generate following output.
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant.taskdefs 2
org.apache.tools.ant 2
org.apache.tools.ant 2
(and grepl
) are inappropriate for this: you aren't filtering based on textual content. You are (a) filtering based on Faults
, and (b) changing the text in Class
Your data:
ds1 <- structure(list(Class = c("org.apache.tools.ant.taskdefs.Definer", "org.apache.tools.ant.taskdefs.Definer", "org.apache.tools.ant.taskdefs.Delete", "org.apache.tools.ant.taskdefs.Deltree", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.Ear", "org.apache.tools.ant.taskdefs.Ear", "org.apache.tools.ant.taskdefs.Echo", "org.apache.tools.ant.Exec", "org.apache.tools.ant.Exec"),
Faults = c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L)),
.Names = c("Class", "Faults"), class = "data.frame", row.names = c(NA, -12L))
Filter on Faults
(you already had this). You only need one of these two commands, they both do the same thing; the major differences are in readability (personal preference) and performance (the second one, in this case, takes about 35% less time, though since they are both measured in microseconds, it seems silly to compete).
ds2 <- subset(ds1, Faults == 2)
ds2 <- ds1[ds1$Faults == 2,]
Update Class
to remove the last word (and dot):
ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class)
# Class Faults
# 1 org.apache.tools.ant.taskdefs 2
# 2 org.apache.tools.ant.taskdefs 2
# 4 org.apache.tools.ant.taskdefs 2
# 5 org.apache.tools.ant.taskdefs 2
# 6 org.apache.tools.ant.taskdefs 2
# 7 org.apache.tools.ant.taskdefs 2
# 8 org.apache.tools.ant.taskdefs 2
# 9 org.apache.tools.ant.taskdefs 2
# 11 org.apache.tools.ant 2
# 12 org.apache.tools.ant 2
Note: this can also be done with sub
instead of gsub
, but the latter is my first-reach since most of my uses deal with larger and repeating regexes. The major (only?) difference between the two is that:
'sub' and 'gsub' perform replacement of the first and all matches respectively
(from ?sub
I know of no tool that does both the filtering and changing in a single command (though perhaps data.table
does, I don't know).
Similar to @egnha's solution (that uses magrittr
), here's one using dplyr
, which many people allege is very easy to read and adapt (at the potential cost of performance):
ds2 <- ds1 %>%
filter(Faults == 2) %>%
mutate(Class = gsub("\\.[^.]*$", "", Class))
Since I mentioned performance, here's a comparison:
microbenchmark(indexing = { ds2 <- ds1[ds1$Faults == 2,]; ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class) },
subset = { ds2 <- subset(ds1, Faults == 2) ; ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class) },
dplyr = { ds1 %>% filter(Faults == 2) %>% mutate(Class = gsub("\\.[^.]*$", "", Class)) })
# Unit: microseconds
# expr min lq mean median uq max neval
# indexing 71.841 87.7045 109.4496 104.2975 120.7075 269.493 100
# subset 102.473 115.6020 147.0108 139.1230 165.5620 287.726 100
# dplyr 1067.030 1156.3745 1323.1174 1225.4805 1351.2920 4270.308 100
For the record, dplyr
used in this way is not often this speed-poor in comparison to other methods. It is not commonly faster, but it is not often an order-of-magnitude slower.