Search code examples
statisticsdata-sciencedata-analysisdata-miningexploratory-data-analysis

Which is the best Data Mining model to extrapolate known values to missing values in a table? (General question)


I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.

So, let's say we have a table with three columns and around 4000 rows:

YEAR COLOR NAME
1900 Green David
1901 Yellow Sarah
1902 Green ???
1902 Red Sarah
2020 Purple John

Any value for any field can be repeated in the dataset (also Year values).

In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).

My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)

I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):

  1. For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".

  2. When the above process ends, the number of occurrences for each name is counted.

  3. The above process is applied 30 times and the results for each name are displayed in a plotbox.

However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:

Possible output for the task

Which do you think it could be a good approach to this task? I appreciate any help.


Solution

  • I think you have the process down, it's converting the data which may be the first hurdle.

    I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.

    You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.

    Loop through this 30 times with an f loop to achieve the result.

    It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.