I am working on a project wherein I have a set of keywords [abc, xyz, klm]`. I also have a bunch of text files with content [1.txt, 2.txt, 3.txt].
What I am doing is bumping the keywords against the text files to find the line where the keyword occurs and it can do so multiple times. So I want to store the ID (text file name without .txt), Extracted_Data, Line_Number, Spwaned_Across (keyword may be spread across 2 lines)
for each occurence.
I decided to create a table for each keyword to store this data.
Tables : abc, xyz, klm
Table abc sample data :
ID Extracted_Data Line_Number Spawned_Across
12 MySQL is wonderful. What is 'abc' 34 1
So I end up with a table for each keyword. In my project, there are about 150 keywords and it can grow. So 150 tables.
Why did I choose to do this way?
For now I am required to find if the keyword exists in a file and I am sure in the future I will be asked to show where or how it occurred in the file. I am planning on creating a table automatically for each new keyword, this way I don't have to manually created each one of them or a giant table with 100s of columns.
Did I make the right decision? Your input is highly appreciated.
Don't do that. No database library is optimized for dynamic table names and you'll end up having to create your query from scratch each time you want to access a table. Also, how would you answer questions like "what data did I find on line 34 of file 12"?
You'll want three tables. In PostgreSQL syntax [*], that'd be:
CREATE TABLE source (sourceid SERIAL, filename VARCHAR NOT NULL);
CREATE TABLE keyword (keywordid SERIAL, keyword VARCHAR NOT NULL);
CREATE TABLE location (locationid SERIAL,
sourceid INTEGER NOT NULL REFERENCES source(sourceid),
keyword INTEGER NOT NULL REFERENCES keyword(keywordid),
data VARCHAR NOT NULL,
line INTEGER NOT NULL,
span INTEGER NOT NULL);
When you start processing a new text file, create a new source
tuple and remember its sourceid. When you encounter a keyword, either insert a new record for it and remember its keywordid or look up the old record. Then insert that sourceid, keywordid, and other relevant data into location
.
To answer the question I posed earlier:
SELECT * FROM
location JOIN source ON location.sourceid = source.sourceid
JOIN keyword ON location.keywordid = keyword.keywordid
WHERE
source.filename = 'foo.txt' AND
location.line = 34;
Yes, it's more work up front to do it the "right" way but you'll be paid back a million times over in performance, ease of maintenance, and ease of using the results.
[*] The MySQL syntax will be similar but I don't remember it off the top of my head and you can figure out the differences pretty easily.