Search code examples
rdataframesortingnumeric

Is it possible to treat numbers with multiple decimal points as numeric to check for letter characters present in column and sort column in R?


I have a dataframe, where one column has numbers that has a mix of integer, numeric, and multiple decimal points (e.g., 3, 1.1, 1.2.1, 1.2.2.3). I have two questions:

  1. How would I check to make sure that only numbers are being used in the column (i.e., no characters are present)? If there are characters present, then which rows?
  2. Is there a way in R to treat these as numeric in order to sort the column?

Minimal Example

Data

library(dplyr)

df <-
  structure(list(
    index = 1:15,
    section = c("2.1.1", "2.1.1", "2.1.2.4", "2.1.3", "2.1.2.9", "2.1.4", 
                "2.1.4", "2.1.4", "3", "3", "4", "1.1", "1.5", "1.5", "b.1")),
  class = "data.frame",
  row.names = c(NA,-15L))

What I Have Tried

If I try to make the column numeric, then it coerces the numbers with multiple decimal points to NA.

> as.numeric(df$section)
 [1]  NA  NA  NA  NA  NA  NA  NA  NA 3.0 3.0 4.0 1.1 1.5 1.5

Warning message: NAs introduced by coercion

Then, for testing for characters in a column, I know that if I just had integers or regular numeric values that I could do this to test for which rows have characters (excluding NAs):

# Check for which (if any) rows have NAs.
na.index <- which(is.na(df$section))

# Find any rows that are character, excluding NAs (hence the setdiff).
index <- which(is.na(as.numeric(as.character(df$section)))) %>%
  setdiff(na.index)

# Output
index

[1]  1  2  3  4  5  6  7  8 15

Here, it is treating any number with multiple decimals as character (and the one with the letter). So, I would like to be able to treat the multiple decimal numbers as numeric and then flag b.1 as character. I am fine to create a new column for distinguishing these.

For sorting, it seems like base sort will arrange them properly, but unsure if this works all the time.

sort(df$section)

#Output
[1] "1.1"     "1.5"     "1.5"     "2.1.1"   "2.1.1"   "2.1.2.4" "2.1.2.9" "2.1.3"   
"2.1.4"   "2.1.4"   "2.1.4"   "3"       "3"       "4"       "b.1" 

Here is my expected output (taking into account sorting and checking for letter characters). When sorting, if the section numbers are the same, then they can be sorted by the index column (lower value sorted first).

Expected Output

   index section      type
1     12     1.1   numeric
2     13     1.5   numeric
3     14     1.5   numeric
4      1   2.1.1   numeric
5      2   2.1.1   numeric
6      3 2.1.2.4   numeric
7      5 2.1.2.9   numeric
8      4   2.1.3   numeric
9      6   2.1.4   numeric
10     7   2.1.4   numeric
11     8   2.1.4   numeric
12     9       3   numeric
13    10       3   numeric
14    11       4   numeric
15    15     b.1 character

I have seen some discussion on SO in other languages like in Java (here), but unsure how to handle it in R, especially since there are not the same number of decimals in each row.


Solution

  • We could use str_detect to check if there are any letters in the string. To sort the column use arrange:

    library(dplyr)
    library(stringr)
    df %>% 
        mutate(test = str_detect(section, "[a-zA-Z]")) %>% 
        arrange(section)
    

    output:

      index section  test
    1     12     1.1 FALSE
    2     13     1.5 FALSE
    3     14     1.5 FALSE
    4      1   2.1.1 FALSE
    5      2   2.1.1 FALSE
    6      3 2.1.2.4 FALSE
    7      5 2.1.2.9 FALSE
    8      4   2.1.3 FALSE
    9      6   2.1.4 FALSE
    10     7   2.1.4 FALSE
    11     8   2.1.4 FALSE
    12     9       3 FALSE
    13    10       3 FALSE
    14    11       4 FALSE
    15    15     b.1  TRUE