Search code examples
rtaxonomycategorical-datamerging-data

Combining factor levels in data frame column


I have a data frame data with a column, named "Project License", which represents a categorical variable, and, thus, in R terminology, is a factor. I'm trying to create a new column, where open source software licenses are combined into larger categories per my classification. However, when I try to combine (merge) levels of that factor, I end up either with a column, where all levels are lost, or unchanged, or with an error message, such as the following one:

Error in factor(data[["Project License"]], levels = classification, labels = c("Highly Restrictive", : invalid 'labels'; length 4 should be 1 or 6

Here's my code for this functionality (extracted from a function):

myLevels <- c('gpl', 'lgpl', 'bsd',
              'other', 'artistic', 'public')
myLabels <- c('GPL', 'LGPL', 'BSD',
              'Other', 'Artistic', 'Public')

licenses <- factor(data[["Project License"]],
                   levels = myLevels, labels = myLabels)

data[["Project License"]] <- licenses

classification <- c(highly = c('gpl'),
                    restrictive = c('lgpl', 'public'),
                    permissive = c('bsd', 'artistic'),
                    unknown = c('other'))

restrictiveness <- 
  factor(data[["Project License"]],
         levels = classification,
         labels = c('Highly Restrictive', 'Restrictive',
                    'Permissive', 'Unknown'))

data[["License Restrictiveness"]] <- restrictiveness

I have also tried some other approaches (including ones described in section 8.2.5 in "R Inferno"), but also unsuccessful so far.

What am I doing wrong and how to solve this problem? Thank you!

UPDATE (Data):

> head(data, n=20)
   Project ID Project License
1       45556            lgpl
2       41636             bsd
3       95627             gpl
4       66930             gpl
5       51103             gpl
6       65637             gpl
7       41834             gpl
8       70998             gpl
9       95064             gpl
10      48810            lgpl
11      95934             gpl
12      90909             gpl
13       6538         website
14      16439             gpl
15      41924             gpl
16      78987             gpl
17      58662            zlib
18       1904             bsd
19      93838          public
20      90047            lgpl

> str(data)
'data.frame':   45033 obs. of  2 variables:
 $ Project ID     : chr  "45556" "41636" "95627" "66930" ...
 $ Project License: chr  "lgpl" "bsd" "gpl" "gpl" ...
 - attr(*, "SQL")=Class 'base64'  chr "ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDMxNC5ncm91cHMKV0hFUkUgZ3JvdXBfaWQgPCAxMDAwMDA="
 - attr(*, "indicatorName")=Class 'base64'  chr "cHJqTGljZW5zZQ=="
 - attr(*, "resultNames")=Class 'base64'  chr "UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl"

UPDATE 2 (Data):

> unique(data[["Project License"]])
 [1] "lgpl"       "bsd"        "gpl"        "website"    "zlib"
 [6] "public"     "other"      "ibmcpl"     "rpl"        "mpl11"
[11] "mit"        "afl"        "python"     "mpl"        "apache"
[16] "osl"        "w3c"        "iosl"       "artistic"   "apsl"
[21] "ibm"        "plan9"      "php"        "qpl"        "psfl"
[26] "ncsa"       "rscpl"      "sunpublic"  "zope"       "eiffel"
[31] "nethack"    "sissl"      "none"       "opengroup"  "sleepycat"
[36] "nokia"      "attribut"   "xnet"       "eiffel2"    "wxwindows"
[41] "motosoto"   "vovida"     "jabber"     "cvw"        "historical"
[46] "nausite"    "real"

Solution

  • The problem is that the number of levels does not equal the number of labels in the factor creation, nor is it length 1.

    From ?factor:

    labels  
      either an optional character vector of labels for the levels (in the same order as
      levels after removing those in exclude), or a character string of length 1.
    

    You need to make these agree. The names in classification are not a hint to factor to combine the lables.

    For example:

    factor(..., levels=classification, labels=c('Highly Restrictive',
                                                'Restrictive.1',
                                                'Restrictive.2',
                                                'Permissive.1',
                                                'Permissive.2',
                                                'Unknown'))
    

    To map the factor to another with fewer levels, you can index a vector by name. Turning the classification vector around as a lookup:

     classification <- c(gpl='Highly Restrictive',
                         lgpl='Restrictive', 
                         public='Restrictive',
                         bsd='Permissive',
                         artistic='Permissive',
                         other='Unknown')
    

    To use this as a lookup table:

    data[["License Restrictiveness"]] <- 
        as.factor(classification[as.character(data[['Project License']])])
    
    head(data)
    ##   Project ID Project License License Restrictiveness
    ## 1      45556            lgpl             Restrictive
    ## 2      41636             bsd              Permissive
    ## 3      95627             gpl      Highly Restrictive
    ## 4      66930             gpl      Highly Restrictive
    ## 5      51103             gpl      Highly Restrictive
    ## 6      65637             gpl      Highly Restrictive