Search code examples
sqldatabaseanalysis

Is it good practice to do analysis on data within a database?


I was recently asked this question in an interview: You have an SQL table containing a column text, each row of which contains multiple words delimited by spaces. How would you write an SQL query to sort the words by their frequency of occurrence across the table?

Being a beginner in SQL, I was stumped to say the least. I could think of an easy solution via a Python script. So I suggested this to him, but the interviewer wanted a solution purely using SQL, stating that they often need to constantly do such quick analysis on real-time streamed data, and that it's overkill to take frequent data dumps, load them in Python, and then do the analysis.

I don't really have much experience working with huge data in a production environment, so could someone tell me if the interviewer was justified in his reasoning? How would you solve this problem?


Solution

  • No doubt I am biased having written a book called "Data Analysis Using SQL and Excel" and teaching a course at Columbia University entitled "Data Analysis Using SQL and Relational Databases". Clearly, a lot of analysis and good analysis can be done using relational databases.

    Text analysis is definitely on the edge of what can be done. A string consisting of words is not traditional relational data, and databases do not have very good string processing functions. Most databases have some way of turning such data into an inverted index (which Bob Kaufman aptly describes in the comment). But this gets much more complicated if you have to deal with punctuation, word forms, and synonyms -- much less sentiment analysis, disambiguation, and so on.

    I suspect that the interviewer was asking more about what you know about databases and what you know about text analysis, and not for a particular solution ("Split the columns into an inverted index and use that. How you create the inverted index depends on the database."). This is a way of seeing whether you really understand both topics.