Search code examples
rfile-read

Extract specific lines and make a list of those in R


I have a file, from which I want to extract the number after segsites: and make a histogram with bins. I've written some code that checks if a line begins with the word "segsites", then extracts that line and puts it in a data frame.

However, it's not doing what it's supposed to. It extracts some numbers but they do not correspond to the values I have in the file. I've attached a screenshot to show what the file looks like. It's an example and not the actual file.

library(dplyr)
library(ggplot2)

 txt <- readLines("file.msOut")

 lns <- (data.frame((beg=which(grepl("segsites:",txt)))))

  output <- cut(lns, breaks = seq(0,1000, by= 100), labels = c("<100","100-200","200-300","300-400","400-500",
                                                         "600-700","700-800,800-900","900-100"))

table(output) %>% 
  as.data.frame() %>% 
  ggplot(aes(x = output, y = Freq)) + 
  geom_col()

enter image description here

Sample data from txt

enter image description here


Solution

  • Using regex and supposing txt contains the data from the image

    txt <- c('segsites: 10','test')
    as.numeric(gsub('\\D', '', grep('segsites\\:', txt, value = TRUE), perl = TRUE))
    # [1] 10