I am absolutely new to coding so please forgive me if this should be very easy to solve or to find - maybe it's so simple that nobody has bothered explaining so far or I just haven't been searching with the right keywords.
I have a column in my dataset that contains the letters f, n, i in all possible combinations. Now I want to find only those rows that contain either f or n, but not both of them. So that could be f, or fi, or n, or ni. Then I want to compare those two sets of rows to each other in a boxplot. So ideally I would have two boxes: one with all the data points belonging to group f, including fi, and one with all the data points belonging to group n, including ni.
Example of my dataset:
df <- data.frame(D = c("f", "f", "fi", "n", "ni", "ni", "fn", "fn"), y = c(1, 0.8, 1.1, 2.1, 0.9, 8.8, 1.7, 5.4))
D y
1 f 1.0
2 f 0.8
3 fi 1.1
4 n 2.1
5 ni 0.9
6 ni 8.8
7 fn 1.7
8 fn 5.4
Now what I want to get is this subset:
D y
1 f 1.0
2 f 0.8
3 fi 1.1
4 n 2.1
5 ni 0.9
6 ni 8.8
and then somehow have 1,2,3 and 4,5,6 in a group each, to plot in a boxplot.
So far I have only succeeded in getting a subset that has only entries with either f or n, but not fi, ni etc, which is not what I want, with this code:
df2<-df[df$D==c("f","n"),]
and in creating a subset that has all different groups with f and n:
df2 <- df[grepl("f", df$D) | grepl("n", bat.df$D),]
I read about the "exclusive or" operator xor but when I try to use that like this:
df2 <- bat.df[xor(match("n", df$D), match("f", df$D)),]
it just gives me a dataframe full of NAs. But even if that did work, I guess I would only be able to make a boxplot with four groups, f, n, fi and ni, where I want only two groups. So how can I get that code to work, and how do I go on from there?
I hope this is not too terrible for a first question! I am kind of bleary eyed after spending far too much time on this. Any help, about my problem, on where to look for the answer or on how to improve the question is very much appreciated!
We all cut our teeth on R at some point, so I'll try to construct an example for you that fits the question. How about:
# simulate a data.frame with "all possible combinations" of singles and pairs
df <- data.frame(txt = as.character(outer(c("i", "f", "n"), c("", "i", "f", "n"), paste0)),
stringsAsFactors = FALSE)
# create an empty factor variable to contain the result
df$has_only <- factor(rep(NA, nrow(df)), levels = 1:2, labels = c("f", "n"))
# replace with codes if contains either f or n, not both(f, n)
df$has_only[which(grepl("f", df$txt) & !grepl("f.*n|n.*f", df$txt))] <- "f"
df$has_only[which(grepl("n", df$txt) & !grepl("f.*n|n.*f", df$txt))] <- "n"
df
## txt has_only
## 1 i <NA>
## 2 f f
## 3 n n
## 4 ii <NA>
## 5 fi f
## 6 ni n
## 7 if f
## 8 ff f
## 9 nf <NA>
## 10 in n
## 11 fn <NA>
## 12 nn n
plot(df$has_only)
Note that this is a bar plot, not a box plot, since a box plot would only plot the range of continuous values, and you have not specified what are the continuous values or what they would look like. But if you did have such a variable, say df$myvalue
, then you could produce a box plot with:
# simulate some continuous data
set.seed(50)
df$myvalue <- runif(nrow(df))
boxplot(myvalue ~ has_only, data = df)