Search code examples
rtextnumbersdata-cleaninglatin

Latin numbers (such as "xxv," "xxxv," "iii," and "ii") in R


How to convert all the Latin numbers (such as "xxv," "xxxv," "iii," and "ii") into numerical values in text data with R?

I need to convert all the Latin numbers in a text data into numerical values. Is there any function in R can convert all the Latin numbers at once?

In addition, when I replace one by one, what if I have some words contains letters like "ii", "i"? For example, would the world "still" be changed into "st1ll"?


Solution

  • txt <- 'How to convert all the Latin numbers (such as "xxv," "xxxv," "iii," and "ii") into numerical values in text data with R?
      
    I need to convert all the Latin numbers in a text data into numerical values. Is there any function in R can convert all the Latin numbers at once?
      
    In addition, when I replace one by one, what if I have some words contains letters like "ii", "i"? For example, would the world "still" be changed into "st1ll"?'
    

    Get a vector of roman characters (note if you make this too large, the gregexpr will throw an error, I didn't test to see what the limit is, however--it's somewhere between 1e2 and 1e3)

    Exclude "I" since that is more likely not to be a numeral, then create your pattern and treat it like any other string find/replace:

    rom <- as.character(as.roman(1:1e2))
    rom <- setdiff(rom, 'I')
    
    p <- sprintf('\\b(%s)\\b', paste0(na.omit(rom), collapse = '|'))
    m <- gregexpr(p, txt, ignore.case = TRUE)
    regmatches(txt, m) <- lapply(regmatches(txt, m), function(x) as.numeric(as.roman(x)))
    
    cat(txt)
    
    # How to convert all the Latin numbers (such as "25," "35," "3," and "2") into numerical values in text data with R?
    #   
    # I need to convert all the Latin numbers in a text data into numerical values. Is there any function in R can convert all the Latin numbers at once?
    #   
    # In addition, when I replace one by one, what if I have some words contains letters like "2", "i"? For example, would the world "still" be changed into "st1ll"?
    

    As a function:

    dd <- data.frame(
      texts = rep(txt, 5)
    )
    
    rom_to_num <- function(text, rom = 1:1e2, exclude = 'I') {
      rom <- as.character(as.roman(rom))
      rom <- setdiff(rom, exclude)
      
      p <- sprintf('\\b(%s)\\b', paste0(na.omit(rom), collapse = '|'))
      m <- gregexpr(p, text, ignore.case = TRUE)
      regmatches(text, m) <- lapply(regmatches(text, m), function(x) as.numeric(as.roman(x)))
      
      text
    }
    
    dd <- within(dd, {
      texts_new <- rom_to_num(texts)
    })