I am using Mockaroo and Excel to create and edit some CSV files that will be inserted into a MySQL database as test data. I have a file named student_takes_module
with columns module_ID
and student_ID
.
There are 10000 student IDs and 500 modules. The spec for the system this is part of requires that between 1 and 400 students are registered to each module (hence student_takes_module
).
Is there a way I can pair up random student_ID
and module_ID
values in keeping with this spec?
Here is one method, that is approximate and computationally painful. It starts by assigning each module a number of students. Then it chooses students for that module randomly.
insert into student_takes_module(module_id, student_id)
select m.module_id, s.student_id
from (select m.*, 10 + rand() * 350 as numstudents
from modules m
) m cross join
students s cross join
(select count(*) as totalstudents) const
where rand() < m.numstudents/const.totalstudents;
The 350
instead of 400
is because the use of rand()
in this context is approximate. The use of 10
is because I think that if you have a minimum of 10 students, then you will probably get at least one student for that class as you cycle through the data.
This approach will be processing 10,000*500 = 5,000,000 rows to generate the test data. However, the calculations are not so bad (rand()
has a reputation for bad performance but that is because of confusion between the function call and order by rand()
). You can test the performance by putting limit
at the end to see how long it takes it to generate 10 rows, then 1000 rows, then 10000, then all the rows you need.