Search code examples
rregexparsingtext-parsing

Parsing a file in R: Tracking events over time with no timestamp


I have a txt file which I parse in R to get some statistical information out of it. It looks like this:

**New Session**
Event A
Event B
Event B
Event C
Event A
Event C
...
**New Session**
...
**New Session**
...

What I need to do is to track for certain events when they happen. I want to receive a table like this:

Event A | Session 1
Event A | Session 1
Event A | Session 2
Event A | Session 3

I have no trouble with the parsing but I have no idea how I could connect the individual events to the session they happened in. There are no timestamps I could use.

One approach might be to cut the file in individual text files containing one session. But I bet there is a way to count up the sessions while parsing through for a certain event?

If I had to cut it up: How do I make R parse all files in a row for a certain string?


Solution

  • It is not uncommon that data of different kind are mixed up in one column of a data file. As long as the different kind of data can be identified in some way, e.g., by a regular expression, the contents of the rows can be moved to different columns. Here, packages data.table and zoo are used:

    library(data.table)
    dt[V1 == "**New Session**", session := paste("Session", seq_len(.N))]
    dt[, session := zoo::na.locf(session)]
    dt[V1 != "**New Session**", .(event = V1, session)][order(event, session)] 
          event   session
    # 1: Event A Session 1
    # 2: Event A Session 1
    # 3: Event A Session 2
    # 4: Event A Session 2
    # 5: Event A Session 3
    # 6: Event B Session 1
    # 7: Event B Session 1
    # ...
    

    Explanation

    • First, the rows indicating the begin of a new session are identified. Only in those rows the column session is filled with a string indicating the session number. Sessions are numbered consecutively as they appear in the source file. No date is needed.
    • Now, all subsequent rows where the session column is empty (NA) are filled with the session number from above (locf means last observation carried forward).
    • Finally, the rows which indicated the beginn of a new session are being ignored, leaving only events in the first column. This column is renamed accordingly and the whole data.table is ordered by events first and session number last.

    Reproducible data

    dt <- fread("**New Session**
                Event A
                Event B
                Event B
                Event C
                Event A
                Event C
                **New Session**
                Event A
                Event B
                Event B
                Event C
                Event A
                Event B
                **New Session**
                Event A
                Event B
                Event D
                Event D
                Event B
                Event C
                ", header = FALSE, sep = "\n")