Search code examples
programming-languagesrdbms

Data retrieval - Database VS Programming language


I have been working with databases recently and before that I was developing standalone components that do not use databases. With all the DB work I have a few questions that sprang up. Why is a database query faster than a programming language data retrieval from a file.

To elaborate my question further -

Assume I have a table called Employee, with fields Name, ID, DOB, Email and Sex. For reasons of simplicity we will also assume they are all strings of fixed length and they do not have any indexes or primary keys or any other constraints.

Imagine we have 1 million rows of data in the table. At the end of the day this table is going to be stored somewhere on the disk. When I write a query Select Name,ID from Employee where DOB="12/12/1985", the DBMS picks up the data from the file, processes it, filters it and gives me a result which is a subset of the 1 million rows of data.

Now, assume I store the same 1 million rows in a flat file, each field similarly being fixed length string for simplicity. The data is available on a file in the disk. When I write a program in C++ or C or C# or Java and do the same task of finding the Name and ID where DOB="12/12/1985", I will read the file record by record and check for each row of data if the DOB="12/12/1985", if it matches then I store present the row to the user.

This way of doing it by a program is too slow when compared to the speed at which a SQL query returns the results.

I assume the DBMS is also written in some programming language and there is also an additional overhead of parsing the query and what not.

So what happens in a DBMS that makes it faster to retrieve data than through a programming language?

If this question is inappropriate on this forum, please delete but do provide me some pointers where I may find an answer.

I use SQL Server if that is of any help.


Solution

  • There are lots of techniques to speed up various kinds of access. As @Oded says, indexing is the big solution to your specific example: if the database has been set up to maintain an index by date, it can go directly to the entries for that date, instead of reading through the entire file. (Note that maintaining an index does take up space and time, though -- it's not free!)

    On the other hand, if such an index has not been set up, and the database has not been stored in date order, then a query by date will need to go through the entire database, just like your flat-file program.

    Of course, you can write your own programs to maintain and use a date index for your file, which will speed up date queries just like a database. And, you might find that you want to add other indices, to speed up other kinds of queries -- or remove an index that turns out to use more resources than it is worth.

    Eventually, managing all the features you've added to your file manager may become a complex task; you may want to store this kind of configuration in its own file, rather than hard-coding it into your program. At the minimum, you'll need features to make sure that changing your configuration will not corrupt your file...

    In other words, you will have written your own database.