Search code examples
rstringtext

How to manipulate long one-string text file from front to end and store variables in a list in R?


I have one text file which has around 1,000 sets of NAME and N as shown below (I picked only two sets for simplicity).

    NAME="2 B11101001",
     N=5049, 20016, 5163, 20081, 5161, 20431, 5023, 5219, 5221,
       5225, 5223, 5227, 5003, 5105, 20623, 5107, 5109, 5111, 5113, 5121, 5007
    NAME="1 A2110111 >",
     N=12034, 2195, 2197, 2199, 2201, 2109, 2032, 20295, 2203, 2205,
       2207, 2107, 2177, 20546, 11528, 20196, 2105, 21031, 11526,
       11011, 11013, 11512, 11225, 11227, 11229, 13169, 13171,
       13173, 11231, 21128, 11233, 10502, 10500, 10498, 10496,
       10494, 11912, 11778, 10492, 11946, 10490, 10488, 11802,
       10486, 11834, 10484, 11844, 10482, 10478, 11694, 11037,
       12087, 12965, 12957, 12953, 12089, 12091, 12481, 12549,
       12941, 12483, 12101, 12103, 12933, 11800, 12927, 11810,
       12923, 12105, 12111, 12113, 12731, 12739, 20806, 12745,
       12117, 12119, 12503, 10264, 11079, 10262, 12505, 12499,
       14431, 14423, 11649, 11677, 14421, 11081, 14461

I need to load this text file and convert it into list format which looks like this:

$ NAME
[1] 2 B11101001
[2] 1 A2110111 >
$ N
[1] 5049 20016 5163 20081 ... 
[2] 12034 2195 2197 ...

NAME is an identifier of a set of N. N indicates sequential order of nodes.

I have another sets of NAME and other attributes in data.frame format looks like this:

NAME          FARE     FREQUENCY
2 B11101001   1000     10
1 A2110111 >  2000     5  

These will be merged with the loaded text file like this:

$ NAME
[1] 2 B11101001
[2] 1 A2110111 >
$ N
[1] 5049 20016 5163 20081 ... 
[2] 12034 2195 2197 ...
$FARE
[1] 1000
[2] 2000
$FREQUENCY
[1] 10
[2] 5

I think I can merge these two data sets, however I do not have any idea how to load a text file which does not follow ordinal comma-separated format.

Currently I loaded the text file using readChar function but cannot find the ways to convert it into list.

enter image description here

As explained,NAME is an identifier to determine the beginning of a set of N. N is a simple set of numbers delimited by a comma (but its order shall be kept). The next pair starts when we find next NAME. Any ways to implement this?

Your suggestions are highly appreciated.


Solution

  • One way is to use the ReadLines() function to input the text into R.
    Then identify the rows beginning with "NAME=" and "N=" separate the lines into different groups remove the unwanted parts from each line and combine into the appropriate vectors.
    See the comments in the script for more information.

    #read text into R
    text <-readLines("Stackquestion.txt")
    
    #find rows with Name and clean up
    namerows <- grep("NAME=", text)
    namelist <- gsub('",', "", trimws(gsub('NAME="', "", text[namerows])))
    
    
    #find rows with N
    Nrows <- grep("N=", text)
    
    ranges <- c((namerows[-1]-1), length(text))
    
    #Remove the line breaks on the N= rows and combine
    Nlist <-sapply(1:length(Nrows), function(i){
       cleantext<-trimws(text[Nrows[i]:ranges[i]])
       cleantext <- gsub('N=', "", cleantext)
       cleaned<-paste(cleantext, collapse = "")
       cleaned
    })
    
    
    > namelist
    [1] "2 B11101001"  "1 A2110111 >"
    > Nlist
    [1] "5049, 20016, 5163, 20081, 5161, 20431, 5023, 5219, 5221,5225, 5223, 5227, 5003, 5105, 20623, 5107, 5109, 5111, 5113, 5121, 5007"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               
    [2] "12034, 2195, 2197, 2199, 2201, 2109, 2032, 20295, 2203, 2205,2207, 2107, 2177, 20546, 11528, 20196, 2105, 21031, 11526,11011, 11013, 11512, 11225, 11227, 11229, 13169, 13171,13173, 11231, 21128, 11233, 10502, 10500, 10498, 10496,10494, 11912, 11778, 10492, 11946, 10490, 10488, 11802,10486, 11834, 10484, 11844, 10482, 10478, 11694, 11037,12087, 12965, 12957, 12953, 12089, 12091, 12481, 12549,12941, 12483, 12101, 12103, 12933, 11800, 12927, 11810,12923, 12105, 12111, 12113, 12731, 12739, 20806, 12745,12117, 12119, 12503, 10264, 11079, 10262, 12505, 12499,14431, 14423, 11649, 11677, 14421, 11081, 14461"