Search code examples
pythonmysqldatabase-designmemory-efficient

Is this an efficient MySQL database design?


I am working on a project wherein I have a set of keywords [abc, xyz, klm]`. I also have a bunch of text files with content [1.txt, 2.txt, 3.txt].

What I am doing is bumping the keywords against the text files to find the line where the keyword occurs and it can do so multiple times. So I want to store the ID (text file name without .txt), Extracted_Data, Line_Number, Spwaned_Across (keyword may be spread across 2 lines) for each occurence.

I decided to create a table for each keyword to store this data.

Tables : abc, xyz, klm

Table abc sample data :

ID Extracted_Data                         Line_Number Spawned_Across
12 MySQL is wonderful. What is 'abc'      34          1

So I end up with a table for each keyword. In my project, there are about 150 keywords and it can grow. So 150 tables.

Why did I choose to do this way?

For now I am required to find if the keyword exists in a file and I am sure in the future I will be asked to show where or how it occurred in the file. I am planning on creating a table automatically for each new keyword, this way I don't have to manually created each one of them or a giant table with 100s of columns.

Did I make the right decision? Your input is highly appreciated.


Solution

  • Don't do that. No database library is optimized for dynamic table names and you'll end up having to create your query from scratch each time you want to access a table. Also, how would you answer questions like "what data did I find on line 34 of file 12"?

    You'll want three tables. In PostgreSQL syntax [*], that'd be:

    CREATE TABLE source (sourceid SERIAL, filename VARCHAR NOT NULL);
    CREATE TABLE keyword (keywordid SERIAL, keyword VARCHAR NOT NULL);
    CREATE TABLE location (locationid SERIAL,
        sourceid INTEGER NOT NULL REFERENCES source(sourceid),
        keyword INTEGER NOT NULL REFERENCES keyword(keywordid),
        data VARCHAR NOT NULL,
        line INTEGER NOT NULL,
        span INTEGER NOT NULL);
    

    When you start processing a new text file, create a new source tuple and remember its sourceid. When you encounter a keyword, either insert a new record for it and remember its keywordid or look up the old record. Then insert that sourceid, keywordid, and other relevant data into location.

    To answer the question I posed earlier:

    SELECT * FROM
        location JOIN source ON location.sourceid = source.sourceid
        JOIN keyword ON location.keywordid = keyword.keywordid
    WHERE
        source.filename = 'foo.txt' AND
        location.line = 34;
    

    Yes, it's more work up front to do it the "right" way but you'll be paid back a million times over in performance, ease of maintenance, and ease of using the results.

    [*] The MySQL syntax will be similar but I don't remember it off the top of my head and you can figure out the differences pretty easily.