Search code examples
javadatabasepostgresqlplpgsql

Is a good idea do processing of a large amount of data directly on database?


I have a database with a lot of web pages stored.

I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.

What I want to know is:

  • do some processing in the database, and not in the application is a good idea?
  • when this is recommended and when not?
  • are there pros and cons?
  • is possible to extend the language to new features (external APIs/libraries)?

I tried retrieving the content to application (worked), but was to slow and dirty. My preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.

ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.

I will need to process all the data. In this case, doing directly on database is better than retrieving to application?


Solution

  • My preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.

    No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.

    I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.

    ONLY an example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.

    Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.

    I hope my explanation makes sense. Please let me know if you have questions.