I have a file, from which I want to extract the number after segsites: and make a histogram with bins. I've written some code that checks if a line begins with the word "segsites", then extracts that line and puts it in a data frame.
However, it's not doing what it's supposed to. It extracts some numbers but they do not correspond to the values I have in the file. I've attached a screenshot to show what the file looks like. It's an example and not the actual file.
library(dplyr)
library(ggplot2)
txt <- readLines("file.msOut")
lns <- (data.frame((beg=which(grepl("segsites:",txt)))))
output <- cut(lns, breaks = seq(0,1000, by= 100), labels = c("<100","100-200","200-300","300-400","400-500",
"600-700","700-800,800-900","900-100"))
table(output) %>%
as.data.frame() %>%
ggplot(aes(x = output, y = Freq)) +
geom_col()
Sample data from txt
Using regex
and supposing txt
contains the data from the image
txt <- c('segsites: 10','test')
as.numeric(gsub('\\D', '', grep('segsites\\:', txt, value = TRUE), perl = TRUE))
# [1] 10