Search code examples
machine-learningimputation

How to imput inhomogeneously missing data


I have a dataframe of shape 2701x128 It has a lot of missing values. The thing is that some rows can have 95% of filled data and some - only 5%. Let me try to visualize it:

X-axis is number of row(after sort), y-axis is number of non-zero values (SORTED, histogram-like)

enter image description here

X-axis is number of column(after sort), y-axis illustrates, how many non-zero's column have over all rows (SORTED, histogram-like)

enter image description here

I need: i need to imput data as accurate as i can, because this is the problem i need to solve. Problem: I cant interpolate everything with means, medians and othe statistical moments, because it's very rough. I also can't create a usual learning model cause there's NO structure in missing data.

Can you please suggest something as accurate as learning models, which can model the distribution, but be able to deal with completly random misses. So, apparently, the main problem is to create dataset from this unstructured misses. I can't find the solution at the moment.


Solution

  • I think the first problem is considering you data as row-structured Try to think about it as a column-based

    There is Japanese game called sudoku and I can suggest you to follow its strategy

    First of all you need find out the most (but not 100% percent filled column) Lets called this one as B-column What is the percentage of missing data? If it is a small part - build a histogram and look at its PDF - may be simple mean and median will work that out?

    Is there any 100% filled column? Lets call this one a G-column Try to find out is there any non fully-filled column which is strongly correlated to filled one. If so - impute the missing values based on this correlation - you can try to use more than 2 filled column with a basic regression

    You can even try to restore part of data in B-column from 1 set of other non fully-filled column and other part by another set of non fully-filled column and you can do that many times

    Of course you will have a kind of Frankenstein monster - but it is worth try and you always can asses how good an effect it was based on CV

    However it is just a short sketch