Search code examples
rcorrelationtext-mining

Text-mining/word correlation in R


I'm trying to make text mining or rather word correlation work in R.

The bigger picture of what I'm trying to do is, I query the entire exported OpenStreetMap database for all features that are within a specific distance to various longitude-latitude locations. So far, this is working like a charm and I have gotten to the point where I have a data frame column of type character that contains all features in that specific distance where one row represents one longitude-latitude location. The data frame column can be found in this csv and a catalogue of all possible features can be found in this csv.

My next step would now be to categorise the locations depending on their surrounding features. To do this, I would like to use a text mining/word correlation algorithm that is able to create categories based on features that are often present at the same locations.

So in short: I have a column of type character (words separated by commas) where one row contains all features that are within a certain vicinity to a longitude-latitude location. Based on those surrounding features I would like to categorise my locations relying on correlating features.

I have tried findAssocs from the tm package, which unfortunately doesn't work for neither type list, data.frame nor character. I have also found this wonderful documentation that guides through basic text mining in R. The problem here is that it seems like I would have to convert each row of my data frame column into a document to prepare a corpus for further processing. While this might be feasible for my test case of 61 locations, it won't be so much for my final analysis of several tens-of-thousands of locations.

Can anyone prod me in the right direction here? Preferably, without relying on 3rd party software like 'rapidminer'. Having everything in one R script would be a lot better for my use case.

Thank you in advance. If you require any additional information, please let me know.


Solution

  • I have found a step-by-step guide to convert data from my format to one that can be used for text mining. The guide can be found here. This does answer my problem for now. I do apologise for the post.