Search code examples
pythontensorflowkerasconv-neural-network

How to train a CNN on an unlabeled dataset?


I want to train a CNN on my unlabeled data, and from what I read on Keras/Kaggle/TF documentation or Reddit threads, it looks like I will have to label my dataset beforehand. Is there a way to train the CNN in an unsupervised way?
I cannot understand how to initialize y_train and y_test (where y_train and y_test represent usual meanings)
The information about my dataset is as follows:

  1. I have 50,000 matrices of dimension 30 x 30.
  2. Each matrix is divided into 9 subareas (for understanding, as separated by the vertical and horizontal bars).
  3. A subarea is said to be active if it has at least one element equal to 1. If all elements for that subarea are equal to 0, the subarea is inactive.
  4. For the first example shown below, I should get as output the names of subareas that are active, so here, (1, 4, 5, 6, 7, 9).
  5. If no subarea is active, as in the second example, the output should be 0.

First example: Output - (1, 4, 5, 6, 7, 9) First example image

Second example: Output - 0 Second example image After creating these matrices, I did the following:

  1. I put these matrices in a CSV file after reshaping them into vectors of dimension 900 x 1.
  2. So basically, each row in the CSV contains 900 columns with values either 0 or 1.
  3. The classes for my classification problem are numbers from 0-9 where 0 represents the class where no label has an active (value=1) value.

For my model, I want the following:

  • Input: a 900 x 1 vector as described above.
  • Output: one of the values from 0-9,
    where 1-9 represent the active subareas, and 0 represents no active subarea.

What I have done:
I am able to retrieve the data from the CSV file into a data frame and split the data frame into x_train and x_test. But I am unable to understand how to set my y_train and y_test values.
My problem seems very similar to the MNIST dataset, except I don't have the labels. Would it be possible for me to train the model without the labels?

My code currently looks like this:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Read the dataset from the CSV file into a dataframe
df = pd.read_csv("bci_dataset.csv")

# Split the dataframe into training and test dataset
train, test = train_test_split(df, test_size=0.2)

x_train = train.iloc[:, :]
x_test = test.iloc[:, :]

print(x_train.shape)
print(x_test.shape)

Thank you, in advance, for reading this whole thing and helping me out!


Solution

  • Can you tell us why you want to use a CNN specifically? Generally neural networks are used when there's some complication involved in going from feature to output - the artificial neurons are able to learn different behavior as a result of being exposed to the ground truth (i.e., the labels). Most of the time, the researcher using the neural network doesn't even know what features of the input data are being used by the network to come to its output conclusions.

    In the case you have given us, it looks a little bit more like you know what features are important (that is, the sum of a subarea has to be greater than 0 in order to be active). The neural network wouldn't need to really learn anything in particular to do its job. Although it doesn't seem necessary to use a neural network for this process, it does make sense for you to automate it, given the size of your input data! :)

    Let me know if I'm misunderstanding your situation, though?

    Edit: To contrast this with the MNIST dataset - so for identifying handwritten digits, there's some ambiguity that the network has to learn to deal with. Not every kind of handwriting is going to render a 7 the same way. A neural network is able to figure out a couple of the features of a 7 (i.e., there is a high probability that a 7 will have a diagonal line going from top-right-to-bottom-left, which, depending on how you write, could be slightly curved or offset or whatever), as well as a couple of different versions of a 7 (some people do a horizontal slash through the middle of it, other versions of a 7 don't have that slash). The utility of a neural network here is in figuring out all that ambiguity and probabilistically classifying an input as a 7 (because it has seen previous images that it "knows" are 7s). However, in your case, there's only one way for your answer to be rendered - if there's any element greater than 0 in a subarea, it's active! So you don't need to train a network to do anything - you will just need to write some code that automates the summing of the subareas.