I have a dataframe, where one column has numbers that has a mix of integer, numeric, and multiple decimal points (e.g., 3, 1.1, 1.2.1, 1.2.2.3). I have two questions:
Minimal Example
Data
library(dplyr)
df <-
structure(list(
index = 1:15,
section = c("2.1.1", "2.1.1", "2.1.2.4", "2.1.3", "2.1.2.9", "2.1.4",
"2.1.4", "2.1.4", "3", "3", "4", "1.1", "1.5", "1.5", "b.1")),
class = "data.frame",
row.names = c(NA,-15L))
What I Have Tried
If I try to make the column numeric, then it coerces the numbers with multiple decimal points to NA
.
> as.numeric(df$section)
[1] NA NA NA NA NA NA NA NA 3.0 3.0 4.0 1.1 1.5 1.5
Warning message: NAs introduced by coercion
Then, for testing for characters in a column, I know that if I just had integers or regular numeric values that I could do this to test for which rows have characters (excluding NAs):
# Check for which (if any) rows have NAs.
na.index <- which(is.na(df$section))
# Find any rows that are character, excluding NAs (hence the setdiff).
index <- which(is.na(as.numeric(as.character(df$section)))) %>%
setdiff(na.index)
# Output
index
[1] 1 2 3 4 5 6 7 8 15
Here, it is treating any number with multiple decimals as character (and the one with the letter). So, I would like to be able to treat the multiple decimal numbers as numeric and then flag b.1
as character. I am fine to create a new column for distinguishing these.
For sorting, it seems like base sort
will arrange them properly, but unsure if this works all the time.
sort(df$section)
#Output
[1] "1.1" "1.5" "1.5" "2.1.1" "2.1.1" "2.1.2.4" "2.1.2.9" "2.1.3"
"2.1.4" "2.1.4" "2.1.4" "3" "3" "4" "b.1"
Here is my expected output (taking into account sorting and checking for letter characters). When sorting, if the section numbers are the same, then they can be sorted by the index column (lower value sorted first).
Expected Output
index section type
1 12 1.1 numeric
2 13 1.5 numeric
3 14 1.5 numeric
4 1 2.1.1 numeric
5 2 2.1.1 numeric
6 3 2.1.2.4 numeric
7 5 2.1.2.9 numeric
8 4 2.1.3 numeric
9 6 2.1.4 numeric
10 7 2.1.4 numeric
11 8 2.1.4 numeric
12 9 3 numeric
13 10 3 numeric
14 11 4 numeric
15 15 b.1 character
I have seen some discussion on SO in other languages like in Java (here), but unsure how to handle it in R, especially since there are not the same number of decimals in each row.
We could use str_detect
to check if there are any letters in the string. To sort the column use arrange
:
library(dplyr)
library(stringr)
df %>%
mutate(test = str_detect(section, "[a-zA-Z]")) %>%
arrange(section)
output:
index section test
1 12 1.1 FALSE
2 13 1.5 FALSE
3 14 1.5 FALSE
4 1 2.1.1 FALSE
5 2 2.1.1 FALSE
6 3 2.1.2.4 FALSE
7 5 2.1.2.9 FALSE
8 4 2.1.3 FALSE
9 6 2.1.4 FALSE
10 7 2.1.4 FALSE
11 8 2.1.4 FALSE
12 9 3 FALSE
13 10 3 FALSE
14 11 4 FALSE
15 15 b.1 TRUE