Search code examples
c++artificial-intelligencemachine-learninglinguisticsnlp

machine-learning, artificial-intelligence and computational-linguistics


I would love to talk to people who have experience in machine-learning, computational-linguistics or artificial-intelligence in general but by the following example:

• Which existing software would you apply for a manageable attempt building something like google translate by statistic linguistic, machine learning? (Don’t get me wrong I don’t want to just do this, but solely trying to draw a conceptional framework for something most complex in this field, what would you think of if you had the chance to lead a team going to realize such...)

• Which existent database(s)? Which database technology to store results when those are terabytes of data

• Which programming languages besides C++?

• Apache mahunt?

• And, how would those software components work together to power the effort as a whole?


Solution

  • The best techniques available for automated translation are based on statistical methods. In computer science this is known as "Machine Translation" or MT. The idea is to treat the signal (the text to be translated) as a noisy signal and to use error correction to "fix" the signal. For example, suppose you are translating english to french. Assume the english statement was originally french but came out as english. You have to fix it up to restore it. A statistical language model can be built for the target language (french) and for the errors. Errors could include dropped words, moved words, misspelled words, and added words.

    More can be found at : http://www.statmt.org/

    Regarding the db, an MT solution does not need a typical db. Everything should be done in memory.

    The best language to use for this specific task is the fastest one. C would be ideal for this problem because it is fast and easy to control memory access. But any high level language could be used such as Perl, C#, Java, Python, etc.