Generating large amounts of test data using sets of known values?

I am using Mockaroo and Excel to create and edit some CSV files that will be inserted into a MySQL database as test data. I have a file named student_takes_module with columns module_ID and student_ID.

There are 10000 student IDs and 500 modules. The spec for the system this is part of requires that between 1 and 400 students are registered to each module (hence student_takes_module).

Is there a way I can pair up random student_ID and module_ID values in keeping with this spec?

Solution

Here is one method, that is approximate and computationally painful. It starts by assigning each module a number of students. Then it chooses students for that module randomly.

insert into student_takes_module(module_id, student_id)
    select m.module_id, s.student_id
    from (select m.*, 10 + rand() * 350 as numstudents
          from modules m
         ) m cross join
         students s cross join
         (select count(*) as totalstudents) const
    where rand() < m.numstudents/const.totalstudents;

The 350 instead of 400 is because the use of rand() in this context is approximate. The use of 10 is because I think that if you have a minimum of 10 students, then you will probably get at least one student for that class as you cycle through the data.

This approach will be processing 10,000*500 = 5,000,000 rows to generate the test data. However, the calculations are not so bad (rand() has a reputation for bad performance but that is because of confusion between the function call and order by rand()). You can test the performance by putting limit at the end to see how long it takes it to generate 10 rows, then 1000 rows, then 10000, then all the rows you need.