Search code examples
rdataframesplittm

Split rownames from data frame


for a text mining project I have to investigate the developement of a word list over time. For this, I need to split the rownames so that I have one column with the company name and one column with the year. This is a extract from my data frame:

                    abs  access   allow     analysis application approach base big business challenge company 
Adidas_2010.txt     13    25       26          11       41        132   1      266        13     115       1
Adidas_2011.txt      1     3        1           0        0         8   0       11         2      10       0
Adidas_2012.txt     29    35       37          22      110        181   7      384        31     136       3
Adidas_2013.txt     28    47       38          32      180        184   4      451        30     129       3
Adidas_2014.txt     12    42       38          27      159        207   6      921        32     128       6
Adidas_2016.txt     30    47       50          47      162        251   9     1061        32     171      13
Nike_2009.txt       16    15       17          12       33        177   9      346        93     196       1
Nike_2011.txt       10    30        0           3        0         0    0       81         7      31       0
Nike_2012.txt       21    22       12          57      199        300   7      214        11     107       3
Nike_2013.txt       20    32       30          11      123        321   4      331        90     239       3
Nike_2014.txt       33    43       30          33      119        137   6      441        67     318       6
Nike_2015.txt       51    42       41          27      102        151   9     1061        32     221      13 

An this is my code:

dtm <- DocumentTermMatrix(corpus, control=list(dictionary = word_list))
df1 <- data.frame(as.matrix(dtm), row.names = filenames_annualreports) 

I've tried this:

 names_plus_year <- rownames(df1)
 names_plus_year_split <- strsplit(names_plus_year, "_")
 rownames(df1) <- sapply(names_plus_year_split, "[", 1)

But I receive following error:

Error in `.rowNamesDF<-`(x, value = value) : 
  double 'row.names' not allowed 

Is there another way to split the rownames? Thanks a lot! :)


Solution

  • You can split the rownames, bind them rowwise and then bind them columnwise to your data frame, i.e.

     cbind.data.frame(df, do.call(rbind, strsplit(sub('\\..*','' ,rownames(df)), '_')))
    

    which gives,

                    abs access allow analysis application approach base  big business challenge company      1    2
    Adidas_2010.txt  13     25    26       11          41      132    1  266       13       115       1 Adidas 2010
    Adidas_2011.txt   1      3     1        0           0        8    0   11        2        10       0 Adidas 2011
    Adidas_2012.txt  29     35    37       22         110      181    7  384       31       136       3 Adidas 2012
    Adidas_2013.txt  28     47    38       32         180      184    4  451       30       129       3 Adidas 2013
    Adidas_2014.txt  12     42    38       27         159      207    6  921       32       128       6 Adidas 2014
    Adidas_2016.txt  30     47    50       47         162      251    9 1061       32       171      13 Adidas 2016
    Nike_2009.txt    16     15    17       12          33      177    9  346       93       196       1   Nike 2009
    Nike_2011.txt    10     30     0        3           0        0    0   81        7        31       0   Nike 2011
    Nike_2012.txt    21     22    12       57         199      300    7  214       11       107       3   Nike 2012
    Nike_2013.txt    20     32    30       11         123      321    4  331       90       239       3   Nike 2013
    Nike_2014.txt    33     43    30       33         119      137    6  441       67       318       6   Nike 2014
    Nike_2015.txt    51     42    41       27         102      151    9 1061       32       221      13   Nike 2015
    

    You can change the names as per usual.

    DATA

    dput(df)
    structure(list(abs = c(13L, 1L, 29L, 28L, 12L, 30L, 16L, 10L, 
    21L, 20L, 33L, 51L), access = c(25L, 3L, 35L, 47L, 42L, 47L, 
    15L, 30L, 22L, 32L, 43L, 42L), allow = c(26L, 1L, 37L, 38L, 38L, 
    50L, 17L, 0L, 12L, 30L, 30L, 41L), analysis = c(11L, 0L, 22L, 
    32L, 27L, 47L, 12L, 3L, 57L, 11L, 33L, 27L), application = c(41L, 
    0L, 110L, 180L, 159L, 162L, 33L, 0L, 199L, 123L, 119L, 102L), 
        approach = c(132L, 8L, 181L, 184L, 207L, 251L, 177L, 0L, 
        300L, 321L, 137L, 151L), base = c(1L, 0L, 7L, 4L, 6L, 9L, 
        9L, 0L, 7L, 4L, 6L, 9L), big = c(266L, 11L, 384L, 451L, 921L, 
        1061L, 346L, 81L, 214L, 331L, 441L, 1061L), business = c(13L, 
        2L, 31L, 30L, 32L, 32L, 93L, 7L, 11L, 90L, 67L, 32L), challenge = c(115L, 
        10L, 136L, 129L, 128L, 171L, 196L, 31L, 107L, 239L, 318L, 
        221L), company = c(1L, 0L, 3L, 3L, 6L, 13L, 1L, 0L, 3L, 3L, 
        6L, 13L)), row.names = c("Adidas_2010.txt", "Adidas_2011.txt", 
    "Adidas_2012.txt", "Adidas_2013.txt", "Adidas_2014.txt", "Adidas_2016.txt", 
    "Nike_2009.txt", "Nike_2011.txt", "Nike_2012.txt", "Nike_2013.txt", 
    "Nike_2014.txt", "Nike_2015.txt"), class = "data.frame")