Search code examples
machine-learningdata-miningpredictiondata-manipulationdata-cleaning

What can we do with the dataset that 98 percent of the columns are null values?


I want to predict down time of the servers before it happens. To achive this aim, I collected many data from different data sources.

One of the data sources is metric data which contain cpu-time, cpu-percentage, memory-usage, etc. However, values of the columns in this dataset are null. I mean 98% of the many columns are null.

What kind of data preperation technique can be used to prepere the data before apply it to a prediction algorithm.

I appreciate any help.


Solution

  • If I were in your situation my first option would be to ignore this data source. There is too much missing data to be a relevant source of information for any ML algorithm.

    That being said, if you still want to use this source of data, you will have to fill the gaps. Infer the missing data with only 2% of available data is hardly possible, but when you are speaking of more than 90% of missing data, I would advise to have a look at Non-Negative Matrix Factorization (NMF) here.

    A few versions of this algorithm are implemeted in R, also to have better results in inferring such a big amount of missing data you could read this paper which uses times series information -which could be your case- with NMF to get better results. I ran some tests up to 95% of missing data and results were not so bad, hence, as discussed earlier, you could discard some of your data to have only 80% or 90% of missing data, then apply NMF for times series.