I have one text file which has around 1,000 sets of NAME
and N
as shown below (I picked only two sets for simplicity).
NAME="2 B11101001",
N=5049, 20016, 5163, 20081, 5161, 20431, 5023, 5219, 5221,
5225, 5223, 5227, 5003, 5105, 20623, 5107, 5109, 5111, 5113, 5121, 5007
NAME="1 A2110111 >",
N=12034, 2195, 2197, 2199, 2201, 2109, 2032, 20295, 2203, 2205,
2207, 2107, 2177, 20546, 11528, 20196, 2105, 21031, 11526,
11011, 11013, 11512, 11225, 11227, 11229, 13169, 13171,
13173, 11231, 21128, 11233, 10502, 10500, 10498, 10496,
10494, 11912, 11778, 10492, 11946, 10490, 10488, 11802,
10486, 11834, 10484, 11844, 10482, 10478, 11694, 11037,
12087, 12965, 12957, 12953, 12089, 12091, 12481, 12549,
12941, 12483, 12101, 12103, 12933, 11800, 12927, 11810,
12923, 12105, 12111, 12113, 12731, 12739, 20806, 12745,
12117, 12119, 12503, 10264, 11079, 10262, 12505, 12499,
14431, 14423, 11649, 11677, 14421, 11081, 14461
I need to load this text file and convert it into list
format which looks like this:
$ NAME
[1] 2 B11101001
[2] 1 A2110111 >
$ N
[1] 5049 20016 5163 20081 ...
[2] 12034 2195 2197 ...
NAME
is an identifier of a set of N
.
N
indicates sequential order of nodes.
I have another sets of NAME
and other attributes in data.frame
format looks like this:
NAME FARE FREQUENCY
2 B11101001 1000 10
1 A2110111 > 2000 5
These will be merged with the loaded text file like this:
$ NAME
[1] 2 B11101001
[2] 1 A2110111 >
$ N
[1] 5049 20016 5163 20081 ...
[2] 12034 2195 2197 ...
$FARE
[1] 1000
[2] 2000
$FREQUENCY
[1] 10
[2] 5
I think I can merge these two data sets, however I do not have any idea how to load a text file which does not follow ordinal comma-separated format.
Currently I loaded the text file using readChar
function but cannot find the ways to convert it into list.
As explained,NAME
is an identifier to determine the beginning of a set of N
. N
is a simple set of numbers delimited by a comma (but its order shall be kept). The next pair starts when we find next NAME
. Any ways to implement this?
Your suggestions are highly appreciated.
One way is to use the ReadLines()
function to input the text into R.
Then identify the rows beginning with "NAME=" and "N=" separate the lines into different groups remove the unwanted parts from each line and combine into the appropriate vectors.
See the comments in the script for more information.
#read text into R
text <-readLines("Stackquestion.txt")
#find rows with Name and clean up
namerows <- grep("NAME=", text)
namelist <- gsub('",', "", trimws(gsub('NAME="', "", text[namerows])))
#find rows with N
Nrows <- grep("N=", text)
ranges <- c((namerows[-1]-1), length(text))
#Remove the line breaks on the N= rows and combine
Nlist <-sapply(1:length(Nrows), function(i){
cleantext<-trimws(text[Nrows[i]:ranges[i]])
cleantext <- gsub('N=', "", cleantext)
cleaned<-paste(cleantext, collapse = "")
cleaned
})
> namelist
[1] "2 B11101001" "1 A2110111 >"
> Nlist
[1] "5049, 20016, 5163, 20081, 5161, 20431, 5023, 5219, 5221,5225, 5223, 5227, 5003, 5105, 20623, 5107, 5109, 5111, 5113, 5121, 5007"
[2] "12034, 2195, 2197, 2199, 2201, 2109, 2032, 20295, 2203, 2205,2207, 2107, 2177, 20546, 11528, 20196, 2105, 21031, 11526,11011, 11013, 11512, 11225, 11227, 11229, 13169, 13171,13173, 11231, 21128, 11233, 10502, 10500, 10498, 10496,10494, 11912, 11778, 10492, 11946, 10490, 10488, 11802,10486, 11834, 10484, 11844, 10482, 10478, 11694, 11037,12087, 12965, 12957, 12953, 12089, 12091, 12481, 12549,12941, 12483, 12101, 12103, 12933, 11800, 12927, 11810,12923, 12105, 12111, 12113, 12731, 12739, 20806, 12745,12117, 12119, 12503, 10264, 11079, 10262, 12505, 12499,14431, 14423, 11649, 11677, 14421, 11081, 14461"