Search code examples
algorithmtextgrammarlexical-analysislinguistics

Finding language patterns in random text


I've written a script which generates possible twitter handles and checks them for availability. It just iterates through different combinations of allowed symbols: a-z, 0-9, _. Currently it has checked 1926220 combinations, i.e. every one containing 1-5 symbols. Here are brief results: 0 free accounts for 1, 2 and 3 symbols, 750 free for 4, 442711 for 5.

I'm wondering if it is possible to write an algorythm which will analyze these lists and find human-readable words among them. Here is an example:

elnsv
elnt8
eloq4
elosu
elq0_
elq15
elq46

The word elosu differs from anothers and it turns out that there is even a town in Spain called Elosu. How do humans distiguish such words? I think I can try to make a dictionary of syllabels from different languages and try comparing words with it. Can you help me with the formula or with other ideas?

Update: for those ones who want to try implementing it, here is the link to 5-symbol handles.


Solution

  • I'd try to use the wisdom of the crowd to solve this.

    1. Google shows an approximate number of pages containing the query, for example, for me the query elnsv from your example (by not using the "did you mean to..") is giving ~60k results, the query elq0_ has ~23k pages, and the "real" word elosu has ~330k matching pages. This is a strong signification that the word is more likely to be meaningful than the others. So, basically this approach means: use some search engine and use its results to determine what is meaningful and what isn't.

    2. The word elosu has a wikipedia article, though it is not the elosu you meant, it still helps. Note that the wikipedia approach will be great and very accurate to decide which term is a meaningful word, but will be problematic for eliminating terms, so I'd use it as first level 'judge' in a pipeline, and feed the rest to other judges.