I was looking on the internet for chatbots. It was only fun. But now, I love this subject so much that I want to develop my own chatbot.
But the first thing is to look for a good way to manage the "brain" of my chatbot. I think that it's the best solution to save everything in a XML file, isn't it?
So the file type is clear. Comes to the relationship between different nouns etc. When I have a noun, e.g. a tree. How do I save best that a tree has leaves, branches and roots. And that a tree needs water and sunlight to survive?
Should I save it like that or otherwise?
This would be my XML for this tree-example:
<nouns>
<noun id="noun_0">
<name>tree</name>
<relationship>
<has>noun_1</has>
<has>noun_2</has>
<has>noun_3</has>
<need>noun_4</need>
<need>noun_5</need>
</relationship>
</noun>
<noun id="noun_1">
<name>root</name>
</noun>
<noun id="noun_2">
<name>branch</name>
<relationship>
<has>noun_3</has>
</relationship>
</noun>
<noun id="noun_3">
<name>leaf</name>
</noun>
<noun id="noun_4">
<name>water</name>
</noun>
<noun id="noun_5">
<name>light</name>
</noun>
. . .
</nouns>
Data Storage Choices: It Depends
Simple, non-learning bots: XML is fine
It looks like you already have a basic XML structure worked out. For just starting out, I'd say that's fine, especially for AI support-chat kind of bots (if userMsg.contains('lega') then print('TOS & Copyright...'
).
Of course, switching to any new format will take time and overhead.
Learning, Complicated bots: database!
If you're looking to do something much larger, especially if you have CleverBot in mind, I think you're going to need a database. This is because when your file .. is a file and is gigantic and trying to keep it all available in memory is resource intensive. For this kind of project, I'd recommend a database.
Why? English is Complicated
A while back I wrote a nieve bayes spam sorter. It took about 10,000 pieces of spam to "train" it at a 7% accuracy rate, which took about 6 hours and 1.5GB of RAM to hold the data in memory. That's a lot of data. English is very hard and can't really be broken into if 'pony' then 'saddle'
, so for a bot to "learn" the best responses, your database is going to become massive and very quickly.