Search code examples
rdataframetextdata-preprocessing

R: splitting a textfile into a workable dataframe


I am working with a text file containing localization data. Every 5 minutes there are multiple reports which can result into a calculated zone. If it resolves the zone it outputs a identified room ID (4260 and 4256 in the example):

[08/14/2021 05:05:59 600] - TagId: 4194912 Identified RoomId:4260
[08/14/2021 05:05:59 616] - TagId: 4194912 Last Monitorid:4195283
[08/14/2021 05:05:59 631] - TagId: 4194912 After RoomId:2199
[08/14/2021 05:05:59 631] - Localization RoomId: 2199
[08/14/2021 05:05:59 663] - TagId: 4194912 Reporting RoomId:2199
[08/14/2021 05:05:59 663] - MacId: F0_5C_19_C6_88_A4 RSSI: -72
[08/14/2021 05:05:59 678] - MacId: F0_5C_19_C7_86_54 RSSI: -82
[08/14/2021 05:05:59 678] - MacId: F0_5C_19_C6_89_3C RSSI: -45
[08/14/2021 05:05:59 694] - MacId: F0_5C_19_C6_88_22 RSSI: -80
[08/14/2021 05:05:59 709] - MacId: F0_5C_19_C6_88_12 RSSI: -60
[08/14/2021 05:05:59 709] - MacId: F0_5C_19_C6_88_A8 RSSI: -83
[08/14/2021 05:05:59 709] - MacId: F0_5C_19_C6_88_90 RSSI: -89
[08/14/2021 05:05:59 709] - MacId: F0_5C_19_C6_88_2E RSSI: -54
[08/14/2021 05:05:59 913] - MacId: 40_E3_D6_CA_56_5C RSSI: -92
[08/14/2021 05:05:59 913] - MacId: F0_5C_19_C6_88_52 RSSI: -92
[08/14/2021 05:05:59 928] - MacId: F0_5C_19_C6_88_B8 RSSI: -80
[08/14/2021 05:06:00 288] - MacId: F0_5C_19_C6_88_A4 RSSI: -72
[08/14/2021 05:06:00 288] - MacId: F0_5C_19_C7_86_54 RSSI: -82
[08/14/2021 05:06:00 288] - MacId: 40_E3_D6_CA_57_0A RSSI: -90
[08/14/2021 05:06:00 288] - MacId: F0_5C_19_C6_89_3C RSSI: -45
[08/14/2021 05:06:00 413] - MacId: F0_5C_19_C6_88_90 RSSI: -90
[08/14/2021 05:06:00 413] - MacId: F0_5C_19_C6_88_12 RSSI: -60
[08/14/2021 05:06:00 428] - MacId: F0_5C_19_C6_88_22 RSSI: -80
[08/14/2021 05:06:00 428] - MacId: F0_5C_19_C6_88_A8 RSSI: -83
[08/14/2021 05:06:00 428] - MacId: F0_5C_19_C6_88_2E RSSI: -55
[08/14/2021 05:11:00 974] - MacId: F0_5C_19_C6_88_A4 RSSI: -72
[08/14/2021 05:11:01 006] - TagId: 4194912 Identified RoomId:4256
[08/14/2021 05:11:01 021] - TagId: 4194912 Last Monitorid:4195283
[08/14/2021 05:11:01 037] - TagId: 4194912 After RoomId:2199
[08/14/2021 05:11:01 052] - Localization RoomId: 2199
[08/14/2021 05:11:01 084] - TagId: 4194912 Reporting RoomId:2199
[08/14/2021 05:11:01 084] - MacId: F0_5C_19_C7_86_54 RSSI: -83
[08/14/2021 05:11:01 084] - MacId: F0_5C_19_C6_88_78 RSSI: -90
[08/14/2021 05:11:01 099] - MacId: F0_5C_19_C6_89_3C RSSI: -45
[08/14/2021 05:11:01 349] - MacId: F0_5C_19_C6_88_12 RSSI: -60
[08/14/2021 05:11:01 349] - MacId: F0_5C_19_C6_88_2E RSSI: -55
[08/14/2021 05:11:01 349] - MacId: F0_5C_19_C6_88_A8 RSSI: -84
[08/14/2021 05:11:01 349] - MacId: F0_5C_19_C6_88_90 RSSI: -89
[08/14/2021 05:11:01 365] - MacId: F0_5C_19_C6_88_22 RSSI: -80
[08/14/2021 05:11:01 474] - MacId: 40_E3_D6_CA_56_5C RSSI: -93
[08/14/2021 05:11:01 490] - MacId: F0_5C_19_C6_88_52 RSSI: -90
[08/14/2021 05:11:01 490] - MacId: F0_5C_19_C6_88_BE RSSI: -89
[08/14/2021 05:11:01 802] - MacId: F0_5C_19_C6_88_A4 RSSI: -72
[08/14/2021 05:11:01 802] - MacId: 40_E3_D6_CA_57_0A RSSI: -90
[08/14/2021 05:11:01 802] - MacId: F0_5C_19_C6_89_3C RSSI: -45
[08/14/2021 05:11:01 802] - MacId: F0_5C_19_C6_88_78 RSSI: -89
[08/14/2021 05:11:01 802] - MacId: F0_5C_19_C7_86_54 RSSI: -82
[08/14/2021 05:11:02 006] - MacId: F0_5C_19_C6_88_90 RSSI: -89
[08/14/2021 05:11:02 006] - MacId: F0_5C_19_C6_88_22 RSSI: -80
[08/14/2021 05:11:02 021] - MacId: F0_5C_19_C6_88_A8 RSSI: -84
[08/14/2021 05:11:02 021] - MacId: F0_5C_19_C6_88_2E RSSI: -55
[08/14/2021 05:11:02 021] - MacId: F0_5C_19_C6_88_12 RSSI: -60
[08/14/2021 05:11:02 115] - MacId: F0_5C_19_C6_88_52 RSSI: -91
[08/14/2021 05:11:02 115] - MacId: F0_5C_19_C6_88_BE RSSI: -88

I would like to end up with data in the form of:

Image of desired dataframe

In case the RoomId is not resolved in the 5 minute timeframe (from the raw text file) the column RoomId can just be NA.

A very helpful member already showed how to split the columns in the right manner (R: Splitting a string into multiple columns)

So the main question is: How can I structure this raw text file in a way similar to the image where it is a workable dataframe although not all the lines from the raw text file are similar?


Solution

  • Assuming the data is stored in a text file called 'temp.txt', you can read it using readLines. Keep only the rows which has MacId and RSSI values and to get the RoomId keep the 'Identified RoomId' row as well. Split the data in sets and from each set extract Datetime, MacId and RSSI using the code from previous post and extract the room id by removing everything until RoomId. You can combine the output into one dataframe.

    data <- readLines('temp.txt')
    req_data <- grep('MacId.*RSSI|Identified RoomId', data, value = TRUE)
    result <- do.call(rbind, by(req_data, cumsum(grepl('Identified', req_data)), 
              function(x) {
      room_id <- sub('.*RoomId:\\s*', '', x[1])
      cbind(strcapture('\\[(.*)\\] - MacId: (.*) RSSI: (.*)', x[-1], 
                 proto = list(Datetime = character(), MacId = character(), 
                              RSSI = numeric())), RoomId = room_id)
    }))
    rownames(result) <- NULL
    

    For the text data shared I get output as -

    result
                      Datetime             MacId RSSI RoomId
    1  08/14/2021 05:05:59 663 F0_5C_19_C6_88_A4  -72   4260
    2  08/14/2021 05:05:59 678 F0_5C_19_C7_86_54  -82   4260
    3  08/14/2021 05:05:59 678 F0_5C_19_C6_89_3C  -45   4260
    4  08/14/2021 05:05:59 694 F0_5C_19_C6_88_22  -80   4260
    5  08/14/2021 05:05:59 709 F0_5C_19_C6_88_12  -60   4260
    6  08/14/2021 05:05:59 709 F0_5C_19_C6_88_A8  -83   4260
    7  08/14/2021 05:05:59 709 F0_5C_19_C6_88_90  -89   4260
    8  08/14/2021 05:05:59 709 F0_5C_19_C6_88_2E  -54   4260
    9  08/14/2021 05:05:59 913 40_E3_D6_CA_56_5C  -92   4260
    10 08/14/2021 05:05:59 913 F0_5C_19_C6_88_52  -92   4260
    11 08/14/2021 05:05:59 928 F0_5C_19_C6_88_B8  -80   4260
    12 08/14/2021 05:06:00 288 F0_5C_19_C6_88_A4  -72   4260
    13 08/14/2021 05:06:00 288 F0_5C_19_C7_86_54  -82   4260
    14 08/14/2021 05:06:00 288 40_E3_D6_CA_57_0A  -90   4260
    15 08/14/2021 05:06:00 288 F0_5C_19_C6_89_3C  -45   4260
    16 08/14/2021 05:06:00 413 F0_5C_19_C6_88_90  -90   4260
    17 08/14/2021 05:06:00 413 F0_5C_19_C6_88_12  -60   4260
    18 08/14/2021 05:06:00 428 F0_5C_19_C6_88_22  -80   4260
    19 08/14/2021 05:06:00 428 F0_5C_19_C6_88_A8  -83   4260
    20 08/14/2021 05:06:00 428 F0_5C_19_C6_88_2E  -55   4260
    21 08/14/2021 05:11:00 974 F0_5C_19_C6_88_A4  -72   4260
    22 08/14/2021 05:11:01 084 F0_5C_19_C7_86_54  -83   4256
    23 08/14/2021 05:11:01 084 F0_5C_19_C6_88_78  -90   4256
    24 08/14/2021 05:11:01 099 F0_5C_19_C6_89_3C  -45   4256
    25 08/14/2021 05:11:01 349 F0_5C_19_C6_88_12  -60   4256
    26 08/14/2021 05:11:01 349 F0_5C_19_C6_88_2E  -55   4256
    27 08/14/2021 05:11:01 349 F0_5C_19_C6_88_A8  -84   4256
    28 08/14/2021 05:11:01 349 F0_5C_19_C6_88_90  -89   4256
    29 08/14/2021 05:11:01 365 F0_5C_19_C6_88_22  -80   4256
    30 08/14/2021 05:11:01 474 40_E3_D6_CA_56_5C  -93   4256
    31 08/14/2021 05:11:01 490 F0_5C_19_C6_88_52  -90   4256
    32 08/14/2021 05:11:01 490 F0_5C_19_C6_88_BE  -89   4256
    33 08/14/2021 05:11:01 802 F0_5C_19_C6_88_A4  -72   4256
    34 08/14/2021 05:11:01 802 40_E3_D6_CA_57_0A  -90   4256
    35 08/14/2021 05:11:01 802 F0_5C_19_C6_89_3C  -45   4256
    36 08/14/2021 05:11:01 802 F0_5C_19_C6_88_78  -89   4256
    37 08/14/2021 05:11:01 802 F0_5C_19_C7_86_54  -82   4256
    38 08/14/2021 05:11:02 006 F0_5C_19_C6_88_90  -89   4256
    39 08/14/2021 05:11:02 006 F0_5C_19_C6_88_22  -80   4256
    40 08/14/2021 05:11:02 021 F0_5C_19_C6_88_A8  -84   4256
    41 08/14/2021 05:11:02 021 F0_5C_19_C6_88_2E  -55   4256
    42 08/14/2021 05:11:02 021 F0_5C_19_C6_88_12  -60   4256
    43 08/14/2021 05:11:02 115 F0_5C_19_C6_88_52  -91   4256
    44 08/14/2021 05:11:02 115 F0_5C_19_C6_88_BE  -88   4256