I have an Dummy Excel Sheet of 879 students of a University with which I am working with to do some data analysis in python. This excel sheet has various columns like:
But the names of the students are not correctly arranged alphabetically.
I want to equally distribute students "Section" wise such that in each Section has equal number of students with each starting alphabet of their name.
I've tried alphabetically sorting the data in Excel but it orders the whole column alphabetically which I don't desire. Instead I want the data to be arranged "Section" wise so that each section has students whose name starts with each alphabets or none if all names of that particular alphabets are already exhausted in previous sections. All sections would have equal number (or nearly equal) of students.
For example:
the dataset 12 sections with 879 students:
Section A has 74 students
Section B has 74 students
Section C has 74 students
Section D has 73 students
Section E has 73 students
Section F has 73 students
Section G has 73 students
Section H has 73 students
Section I has 73 students
Section J has 73 students
Section K has 73 students
Section L has 73 students
Number of students having first character A is 89
Number of students having first character B is 47
Number of students having first character C is 7
: : : :
: : : :
Number of students having first character Y is 1
Number of students having first character Z is 2
My goal:
Section A will have:
- 7 students whose name start with A (89/12 = 7 students)
- 3 students whose name start with B (47/12 = 3 students)
- 1 student whose name start with C (As 7<12, so cant put all students in all sections, so 1)
: : :
: : :
- 1 student whose name start with Y
- 1 student whose name start with Z
Section B will have:
- 7 students whose name start with A
: : : : : :
: : : : : :
- 0 students whose name start with Y (As no student left with alphabet starting with Y)
- 1 student whose name start with Z
similarly other sections will have distributions like this:
Is there a way to achieve this using Excel or Python Pandas Library query?
Here is my Excel Sheet
Here is a tentative solution. Assign a section number based on the occurrence number of a particular name within all names starting with the same letter, compared to the number of names starting with that letter that should be in each section. :
=LET(names,E5:E883,
rows,ROWS(names),
sections,12,
seq,SEQUENCE(rows,1,0),
startrow,XLOOKUP(LEFT(names,1)&"*",names,seq,,2),
counts,COUNTIF(names,LEFT(names,1)&"*"),
countspersection,counts/sections,
occurrence,seq-startrow,
section,QUOTIENT(occurrence+0.5,countspersection),
section)
Then sort on the assigned section number:
=LET(names,E5:E883,
rows,ROWS(names),
sections,12,
seq,SEQUENCE(rows,1,0),
startrow,XLOOKUP(LEFT(names,1)&"*",names,seq,,2),
counts,COUNTIF(names,LEFT(names,1)&"*"),
countspersection,counts/sections,
occurrence,seq-startrow,
section,QUOTIENT(occurrence+0.5,countspersection),
SORTBY(names,section,1,names,1))
Because the number of students can only be a whole number, there will be a variation in the size of each section depending how many of the numbers assigned are underestimates and how many overestimates compared with the theoretical value ( e.g. there should be 7.4 A's in each section, but in practice there can only be 7 or 8). I've done a bit of analysis on this, and the group sizes are falling out like this:
It's quite difficult to get both the group sizes and numbers per letter exactly right - you would need an iterative method to take this further I think.