Search code examples
rregexregular-language

Extracting the Package name from fully defined class names using R scripting


I have following sort of data set(ds1) in my CSV file that includes class Name and corresponding faults. I intend to extract or filter Package Name from the data having number of faults equal to 2 using R script.

Class                              Faults

org.apache.tools.ant.taskdefs.Definer   2
org.apache.tools.ant.taskdefs.Definer   2
org.apache.tools.ant.taskdefs.Delete    1
org.apache.tools.ant.taskdefs.Deltree   2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.DependSet 2
org.apache.tools.ant.taskdefs.Ear   2
org.apache.tools.ant.taskdefs.Ear   2
org.apache.tools.ant.taskdefs.Echo  1
org.apache.tools.ant.Exec   2
org.apache.tools.ant.Exec   2

I have written following code, but, it does not produce desired output

dschanged<- subset(ds1, grep( "/^([^\\.]+)/", class) & Faults==2 )

Technically, I require proper regular expression to pull the string before last dot(.) to generate following output.

org.apache.tools.ant.taskdefs       2
org.apache.tools.ant.taskdefs       2
org.apache.tools.ant.taskdefs       2
org.apache.tools.ant.taskdefs       2
org.apache.tools.ant.taskdefs       2
org.apache.tools.ant.taskdefs       2
org.apache.tools.ant.taskdefs       2
org.apache.tools.ant.taskdefs       2
org.apache.tools.ant                2
org.apache.tools.ant                2

Solution

  • grep (and grepl) are inappropriate for this: you aren't filtering based on textual content. You are (a) filtering based on Faults, and (b) changing the text in Class.

    Your data:

    ds1 <- structure(list(Class = c("org.apache.tools.ant.taskdefs.Definer", "org.apache.tools.ant.taskdefs.Definer", "org.apache.tools.ant.taskdefs.Delete", "org.apache.tools.ant.taskdefs.Deltree", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.DependSet", "org.apache.tools.ant.taskdefs.Ear", "org.apache.tools.ant.taskdefs.Ear", "org.apache.tools.ant.taskdefs.Echo", "org.apache.tools.ant.Exec", "org.apache.tools.ant.Exec"),
                          Faults = c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L)),
                     .Names = c("Class", "Faults"), class = "data.frame", row.names = c(NA, -12L))
    

    Filter on Faults (you already had this). You only need one of these two commands, they both do the same thing; the major differences are in readability (personal preference) and performance (the second one, in this case, takes about 35% less time, though since they are both measured in microseconds, it seems silly to compete).

    ds2 <- subset(ds1, Faults == 2)
    ds2 <- ds1[ds1$Faults == 2,]
    

    Update Class to remove the last word (and dot):

    ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class)
    ds2
    #                            Class Faults
    # 1  org.apache.tools.ant.taskdefs      2
    # 2  org.apache.tools.ant.taskdefs      2
    # 4  org.apache.tools.ant.taskdefs      2
    # 5  org.apache.tools.ant.taskdefs      2
    # 6  org.apache.tools.ant.taskdefs      2
    # 7  org.apache.tools.ant.taskdefs      2
    # 8  org.apache.tools.ant.taskdefs      2
    # 9  org.apache.tools.ant.taskdefs      2
    # 11          org.apache.tools.ant      2
    # 12          org.apache.tools.ant      2
    

    Note: this can also be done with sub instead of gsub, but the latter is my first-reach since most of my uses deal with larger and repeating regexes. The major (only?) difference between the two is that:

    'sub' and 'gsub' perform replacement of the first and all matches respectively

    (from ?sub).

    I know of no tool that does both the filtering and changing in a single command (though perhaps data.table does, I don't know).

    Similar to @egnha's solution (that uses magrittr), here's one using dplyr, which many people allege is very easy to read and adapt (at the potential cost of performance):

    library(dplyr)
    ds2 <- ds1 %>%
      filter(Faults == 2) %>%
      mutate(Class = gsub("\\.[^.]*$", "", Class))
    

    Since I mentioned performance, here's a comparison:

    microbenchmark(indexing = { ds2 <- ds1[ds1$Faults == 2,]; ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class) },
                   subset   = { ds2 <- subset(ds1, Faults == 2) ; ds2$Class <- gsub("\\.[^.]*$", "", ds2$Class) },
                   dplyr    = { ds1 %>% filter(Faults == 2) %>% mutate(Class = gsub("\\.[^.]*$", "", Class)) })
    # Unit: microseconds
    #      expr      min        lq      mean    median        uq      max neval
    #  indexing   71.841   87.7045  109.4496  104.2975  120.7075  269.493   100
    #    subset  102.473  115.6020  147.0108  139.1230  165.5620  287.726   100
    #     dplyr 1067.030 1156.3745 1323.1174 1225.4805 1351.2920 4270.308   100
    

    For the record, dplyr used in this way is not often this speed-poor in comparison to other methods. It is not commonly faster, but it is not often an order-of-magnitude slower.