Search code examples
pythonsplitsamplingtraining-data

Capturing all data in non-whole train, test, and validate splits


just wondering if a better solution exists for this sort of problem.

We know that for a X/Y percentage split of an even number we can get an exact split of the data - for example for data size 10:

10 * .6 = 6
10 * .4 = 4
          10 

Splitting data this way is easy, and we can guarantee we have all of the data and nothing is lost. However where I am struggling is on less friendly numbers - take 11

11 * .6 = 6.6
11 * .4 = 4.4
          11

However we can't index into an array at i = 6.6 for example. So we have to decide how to to do this. If we take JUST the integer portion we lose 1 data point -

First set = 0..6
Second set = 6..10

This would be the same case if we floored the numbers.

However, if we take the ceiling of the numbers:

First set = 0..7
Second set = 7..12

And we've read past the end of our array.

This gets even worse when we throw in a 3rd or 4th split (30,30,20,20 for example).

Is there a standard splitting procedure for these kinds of problems? Is data loss accepted? It seems like data loss would be unacceptable for dependent data, such as time series.

Thanks!

EDIT: The values .6 and .4 are chosen by me. They could be any two numbers that sum to 1.


Solution

  • First of all, notice that your problem is not limited to odd-sized arrays as you claim, but any-sized arrays. How would you make the 56%-44% split of a 10 element array? Or a 60%-40% split of a 4 element array?

    There is no standard procedure. In many cases, programmers do not care that much about an exact split and they either do it by flooring or rounding one quantity (the size of the first set), while taking the complementary (array length - rounded size) for the other (the size of the second).

    This might be ok in most cases when this is an one-off calculation and accuracy is not required. You have to ask yourself what your requirements are. For example: are you taking thousands of 10-sized arrays and each time you are splitting them 56%-44% doing some calculations and returning a result? You have to ask yourself what accuracy do you want. Do you care if your result ends up being the 60%-40% split or the 50%-50% split?

    As another example imagine that you are doing a 4-way equal split of 25%-25%-25%-25%. If you have 10 elements and you apply the rounding technique you end up with 3,3,3,1 elements. Surely this will mess up your results.

    If you do care about all these inaccuracies then the first step is consider whether you can to adjust either the array size and/or the split ratio(s).

    If these are set in stone then the only way to have an accurate split of any ratios of any sized array is to make it probabilistic. You have to split multiple arrays for this to work (meaning you have to apply the same split ratio to same-sized arrays multiple times). The more arrays the better (or you can use the same array multiple times).

    So imagine that you have to make a 56%-44% split of a 10 sized array. This means that you need to split it in 5.6 elements and 4.4 elements on the average.

    There are many ways you can achieve a 5.6 element average. The easiest one (and the one with the smallest variance in the sequence of tries) is to have 60% of the time a set with 6 elements and 40% of the time a set that has 5 elements.

    0.6*6 + 0.4*5 = 5.6

    In terms of code this is what you can do to decide on the size of the set each time:

    import random
    
    array_size = 10
    first_split = 0.56
    avg_split_size = array_size * first_split 
    floored_split_size = int(avg_split_size)
    
    if avg_split_size > floored_split_size:
        if random.uniform(0,1) > avg_split_size - floored_split_size:
            this_split_size = floored_split_size
        else: 
            this_split_size = floored_split_size + 1    
    else:
        this_split_size = avg_split_size
    

    You could make the code more compact, I just made an outline here so you get the idea. I hope this helps.