Extract entities from folksonomies

I am a newbie to NLP and related technologies. I have been researching on decomposing folksonomies such as, hashtags into individual terms (ex:- #harrypotterworld as harry potter world) in order to carry out Named-Entity Recognition.

But I did not come across any available library or previous work I could use for this. Is this achievable or am I following a wrong procedure? If so, are there any available libraries or algorithmic techniques I could use?

Solution

What you are looking for is a compound splitter. As far as I know, this is a problem that does have some implementations, some of which work reasonably well.

Unfortunately most research I know of has been done on languages that tend to compound nouns (ie. German). Fun fact: Hashtag is a compound word itself.

I once used this one: http://ilps.science.uva.nl/resources/compound-splitter-nl/ It is an algorithm that works on Dutch. It basically uses a dictionary of uncompounded words an assumes a very uncomplicated grammar for compounding: Something along the lines of: Infixes such as n and s are allowed, and compounded words are always a combination of 2 or more uncompounded words from the dictionary.

I think you could use the given implementation for compounded hashtags, if you provided an English dictionary, and adapted the assumed grammar somewhat (You might not want infixes).