Search code examples
xmlchatbot

How best to store data for a chatbot?


I was looking on the internet for chatbots. It was only fun. But now, I love this subject so much that I want to develop my own chatbot.
But the first thing is to look for a good way to manage the "brain" of my chatbot. I think that it's the best solution to save everything in a XML file, isn't it?
So the file type is clear. Comes to the relationship between different nouns etc. When I have a noun, e.g. a tree. How do I save best that a tree has leaves, branches and roots. And that a tree needs water and sunlight to survive?
Should I save it like that or otherwise?

This would be my XML for this tree-example:

<nouns>
    <noun id="noun_0">
        <name>tree</name>
        <relationship>
            <has>noun_1</has>
            <has>noun_2</has>
            <has>noun_3</has>
            <need>noun_4</need>
            <need>noun_5</need>
        </relationship>
    </noun>
    <noun id="noun_1">
        <name>root</name>
    </noun>
    <noun id="noun_2">
        <name>branch</name>
        <relationship>
            <has>noun_3</has>
        </relationship>
    </noun>
    <noun id="noun_3">
        <name>leaf</name>
    </noun>
    <noun id="noun_4">
        <name>water</name>
    </noun>
    <noun id="noun_5">
        <name>light</name>
    </noun>

    . . .

</nouns>

Solution

  • Data Storage Choices: It Depends

    Simple, non-learning bots: XML is fine

    It looks like you already have a basic XML structure worked out. For just starting out, I'd say that's fine, especially for AI support-chat kind of bots (if userMsg.contains('lega') then print('TOS & Copyright...').

    Of course, switching to any new format will take time and overhead.

    Learning, Complicated bots: database!

    If you're looking to do something much larger, especially if you have CleverBot in mind, I think you're going to need a database. This is because when your file .. is a file and is gigantic and trying to keep it all available in memory is resource intensive. For this kind of project, I'd recommend a database.

    Why? English is Complicated

    A while back I wrote a nieve bayes spam sorter. It took about 10,000 pieces of spam to "train" it at a 7% accuracy rate, which took about 6 hours and 1.5GB of RAM to hold the data in memory. That's a lot of data. English is very hard and can't really be broken into if 'pony' then 'saddle', so for a bot to "learn" the best responses, your database is going to become massive and very quickly.