Search code examples
pythonmachine-learningdatasetdata-sciencehub

How to handle NaN values in Activeloop Hub datasets?


I am working on converting a dataset into Activeloop Hub format. The dataset I am working with has NaN values however I am not sure how to handle these values with the Hub dataset format.

The NaN values are appearing in the labels of the dataset.

I know that NaN value represents the absence of that value in the database. Also, from some reading, I know that sklearn implemented algorithms can’t perform on datasets that have such values. I was thinking of erasing the rows that have the NaN values however I don't want to lose any information in the dataset.

Is there a best practice way to input NaN values in Activeloop Hub format?

I am using Hub version 2.3.1.


Solution

  • It sounds like there are no labels for the samples. If so, then upload an empty sample for those labels. Please note that appending an empty sample is not the same as skipping a sample.

    If the NaN values are representing images, videos, etc that do not have labels, they should be uploaded as empty samples like this: ds.labels.append(np.zeros((0,))).