Search code examples
rparsingtext-parsingrscript

Parsing a text file by a delimiter and outputting multiple files with R


I’m trying to break down my server-log into multiple files so I can run some metrics on them. I have this cronjob that adds a string and a timestamp to my server-log at the first of every month, the string looks like this ‘Monthly Breakpoint, March 1 2020’. The idea is that I can break up this large server-log file into multiple log files by this line delimiter, then run some metrics on each file. I’m trying to write a script that will create these output files for me but I’m struggling with it. So far I can read the file and loop through the lines and find the delimiter, but I’m not sure the best approach for a problem like this, maybe I shouldn't be using R and there's an easier way?

# server log
serverLog <- "server-out.log"

# Process File 
conn <- file( serverLog ,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
  print( linn[i] )
  test <- grepl(  "Monthly", linn[i] )
  # print( paste("test: ", test, sep="" ) )
  if( test ) {
    print( "Found Monthly Breakpoint")
  }
}
close(conn)

# Example of the server-out.log file 

[0mGET /notifications [36m304 [0m9.439 ms - -[0m
[0mGET /user/status [36m304 [0m2.137 ms - -[0m
[0mGET /user/status [36m304 [0m5.675 ms - -[0m
[0mPOST /user/login [32m200 [0m19.960 ms - 30[0m
[0mGET /user/status [36m304 [0m9.518 ms - -[0m
[0mGET /user/status [32m200 [0m2.364 ms - 16[0m
[0mGET /user/status [36m304 [0m1.396 ms - -[0m
[0mGET /user/status [36m304 [0m1.087 ms - -[0m
[0mPOST /user/login [32m200 [0m300.214 ms - 30[0m
[0mGET /user/status [36m304 [0m4.374 ms - -[0m
[0mGET /localUser [32m200 [0m2.260 ms - 1045[0m

 Monthly Breakpoint, March 1 2020

[0mGET /user/status [32m200 [0m5.284 ms - 16[0m
[0mGET /user/status [36m304 [0m2.101 ms - -[0m
[0mGET /users [32m200 [0m2.387 ms - 36[0m
[0mGET /notifications [32m200 [0m30.395 ms - 2624[0m
[0mGET /user/status [36m304 [0m2.172 ms - -[0m
[0mGET /user/status [36m304 [0m1.424 ms - -[0m
[0mGET /user/status [36m304 [0m2.074 ms - -[0m
[0mGET /user/status [36m304 [0m0.920 ms - -[0m
[0mGET /users [36m304 [0m2.471 ms - -[0m
[0mGET /notifications [36m304 [0m8.416 ms - -[0m
[0mGET /user/status [36m304 [0m1.757 ms - -[0m
[0mGET /user/status [36m304 [0m1.114 ms - -[0m
[0mGET /favicon.ico [33m404 [0m2.218 ms - 150[0m
[0mGET /user/status [36m304 [0m2.003 ms - -[0m
[0mPOST /user/login [32m200 [0m175.473 ms - 30[0m
[0mGET /user/status [36m304 [0m3.893 ms - -[0m
  • Update I tried using csplit because it sounds like a good option for this problem, but I can't get that working either.. can you provide an example?
csplit -z server-out.min /Monthly/ '{*}'

csplit: illegal option -- z
usage: csplit [-ks] [-f prefix] [-n number] file args ...

Solution

  • This isn't the most elegant answer but this got me what I needed. I'll try out the other answer, it's a good idea to keep the data in my R environment so I can run all my metrics without reading in unnecessary files. Thanks @Till

    #~~~~~~~~~~~~~~~~~~~~~~#
    #~~ Parse Server Log ~~#
    #~~~~~~~~~~~~~~~~~~~~~~#
    
    # Read File 
    serverLog <- "server-out.min"
    conn <- file( serverLog ,open="r")
    linn <-readLines(conn)
    num <- 1
    
    # Loop through File 
    for (i in 1:length(linn)){
      # print( linn[i] )
    
      # current output file
      file <- paste( "server-log-", num, sep = "")
      # write to file
      write(linn[i], file=file, append=TRUE)
    
      # Check for Monthly Delimiter, update num
      test <- grepl(  "Monthly", linn[i] )
      if( test ) {
        print( "Found Monthly Breakpoint")
        num <- num+1
      }
    }
    close(conn)