mysql mysql-python frequency-distribution large-data

Efficient way to get frequency distribution of values in a large MySql table

I have two tables viz. Total_Data and Distinct_S1.

Total_Data has 3.5 million rows. Fields: "S1", "S2", "S3", "S4"

Distinct_S1 has 1 million rows. Fields: "S1", "frequency". "S1" of Distinct_S1 consists of all distinct values which occur in "S1" of Total_Data.

Task at hand: "frequency" of Distinct_S1 for S1 = 'xxyy' (suppose) should consist of the number of times 'xxyy' occurs in Total_Data.

I used the following query from within a python script(MySQLdb imported) to accomplish the Task at hand:

cur.execute("update Distinct_S1 set frequency=(select count(*) from Total_Data where S1='%s') where S1='%s'"%(S1_val, S1_val)

The above query works fine but it seems to take a lot of time. Is there any other faster way to achieve the same?

Solution

You can use the multi-table update syntax with a join to update all rows at once. See the manual for the syntax.

But a much easier way is to also insert all the values in the table directly without first populating it with the distinct values.

INSERT INTO Distinct_S1 (S1, frequency)
SELECT S1, COUNT(*)
FROM Total_Data
GROUP BY S1