Search code examples
csvmachine-learningjupyter-notebookdata-cleaningdata-processing

What's the target in my data for training a stock price predictor?


I want to build a stock price predictor for an economical indicator in Venezuela, I've cleaned and structured the historical data that I want to use (from the last 10 years), but I have doubts because it's my first machine learning project, my CSV data with 3000+ entries looks like this:

2553
11-28-2017;0.8823561
2554
11-29-2017;0.9679446
2555
11-30-2017;0.9719271
2556
12-1-2017;1.0302427

As you can see column 0 have the date and column 1 have the price for that particular date, in this case the training data (X) should be the price, however the methods that I want to use expect both X and Y (supervised learning), since it's my first time obtaining my own data I feel a bit lost, there you have my code so far: https://github.com/marcelodiaz558/Venezuela-dollar-price-predictor/blob/development/model.ipynb I would like to train my model in the future with a LSTM or maybe I'll start with a simple Artificial Neural Network for testing, when I solve my doubts about the data, I don't know who Y should be.


Solution

  • Y / your target is what you want to predict. X / your training data is some vector representation of your prior knowledge that can be used to better your predictions of the unknown quantity. In a simple time-series prediction with a simple regressor, your training data could be the prices from the past N days.

    So using your example data where you want to be able to predict the price one day in the future based on the prices from the last two days (N=2), your X and Y would be

    X = [[0.8823561, 0.9679446], [0.9679446, 0.9719271]]
    Y = [0.9719271, 1.0302427]
    

    So to do machine learning on your data you would need to pre-process your data depending on exactly what you want. Some algorithms are specifically designed for this task, so will either not need pre-processing or it is done automatically in the implementation.