I have a problem when I try to retrieve data from a JSON structure. I can see a pattern but I don't see a clear way of doing it at the same time general and efficient.
The data I am using is the one corresponding to TMDB Box Office Prediction from a challenge from Kaggle and among the many columns it has, some of them are formated in a JSON-ish way. Because I don't see any value on most of the data you can extract from it, I actually wanted to get just a list of names or name of each of this JSON since each JSON column has an attribute that is named "name".
I give some examples so that we all can see a bit what I mean:
[{'id': 34055, 'name': 'Pokémon Collection', 'poster_path': '/j5te0YNZAMXDBnsqTUDKIBEt8iu.jpg', 'backdrop_path': '/iGoYKA0TFfgSoZpG2u5viTJMGfK.jpg'}]
[{'id': 12, 'name': 'Adventure'}, {'id': 16, 'name': 'Animation'}, {'id': 10751, 'name': 'Family'}, {'id': 14, 'name': 'Fantasy'}]
[{'name': 'Woolner Brothers Pictures Inc.', 'id': 9233}]
The code I did which I wanted to be as generic as possible, looks like the following which chooses the columns I am interested in and expects to retrieve the name and give a list of the names can be find.
#trainSet is the data from the train.csv from the challenge
trainingNames <- as.data.frame(names(trainSet))
trainingNames1 <- trainingNames[c(2,4,12,13,16,20),]
for (column in trainingNames1){
for (i in 1:length(trainSet[[column]])) {
keywords <- str_extract_all(toString(trainSet[[column]][[i]]), "'name': '(\\s*\\w+)+'")[[1]]
keywords <- gsub("'", "", keywords)
trainSet[[column]][i] <- list(gsub("'name':", "", keywords))
#print(trainSet[[column]][i] )
}
}
This regex if I test in under a certain case, it works but because it has the wildcards of + I understand that is iterating way too much over the JSON and that mainly is doing way too many combinations. Prior to this regex, I had this another regex but the main problem is that it was not considering the cases with "phrases" so anything that would be more than one word, would not read it. (e.g name : The Lord of the Rings)
I've seen a guy that does the following but it is too me too specific and I wanted to save time on not to programatically give the patterns that the data is providing, so even it is a good solution, I do not like it because it is attached to a particular column:
train$collection_name <- str_extract(train$belongs_to_collection, "(?<=name\\'\\:\\s{1}\\').+(?=\\'\\,\\s{1}\\'poster)")
So what I would like to know, is if it is possible to do it in a similar way I am thinking but considering that for instance, has to end with " ' " or another pattern that is not the next JSON element.
Maybe I am also approaching it in a wrong way so I appreciate any kind of comment/orientation you provide me.
Thanks!
If you need the group, then:
'name': '(\\s*[^\\s']*)*'
else
'name': '([^']*)'
The first regex would be faster since it matches spaces up to the next non space and non quote. If you don't need the grouping in the way it had it, you can use the second regex. This matches from the first quote till the next quote, matching all non quotes.