Search code examples
rstringvectornumericspell-checking

Identify which of two vectors is numeric and which is strings in R (but more generally for other platforms as well)


I need to write a function that identifies which of two vectors it receives is (the most likely to be) the numeric vector and which is (most likely to be) the character vector.

The two vectors might look something like this:

vec1 <- c("2", "3", "14", "7")
vec2 <- c("Arctic tern", "Blue tit", "bald eagle", "Cassowary")

But this is intended for use by people who are not necessarily computer literate so it may get the odd...

vec1 <- c("2", "3", "fourteen", "7")

...instead, so it has be flexible.

The text could be full sentences or single characters and may have numeric digits mixed in with it too like "2for1" or "world war 2" so this must be accounted for. That's why I'm looking for a function to pick what it thinks is the "most likely" numeric vector of the two.

Any ideas? I think the "Levenshtein distance" might be helpful but it's hard to say how. I'm working specifically in R but a general purpose algorithm / solution would be fine.

EDIT: The solution posed does not answer the question. Of course I am familiar with basic data formatting. The issue here is that there are two vectors and I need an algorithm (however rough) that will guess which is more likely to be the numeric of the two. But the data that goes into it could be quite messy and might not nicely fall into the bounds of a numeric vector and setting both vectors to "strings" is not an acceptable outcome. Please re-open my question.


Solution

  • Something like this:

    library(english)
    
    foo <- function(...) {
      stopifnot("input vectors must have identical lengths" = 
                 length(unique(lengths(list(...)))) == 1L)
      numwords <- setNames(1:100, english(1:100))
      nums <- lapply(list(...),
                     function(x) ifelse(unname(is.na(numwords[x])), 
                            x, 
                            numwords[x])
      )
      
      
      suppressWarnings(
      nums <- lapply(nums, as.numeric)
      )
      which.min(vapply(nums, \(x) sum(is.na(x)), integer(1)))
      
    }
    
    vec1 <- c("2", "3", "14", "7")
    vec2 <- c("Arctic tern", "Blue tit", "bald eagle", "Cassowary")
    foo(vec1, vec2)
    #[1] 1
    
    vec3 <- c("apple", "orange", "three", "moon")
    foo(vec2, vec3)
    #[1] 2
    
    foo(vec1, vec2, vec3)
    #[1] 1