Building a Stock Prediction Model with TensorFlow: A Step-by-Step Tutorial

2018-08-13 · Ryan · Post Comment

Introduction

The STATWORX team created a dataset from the Google Finance API containing S&P 500 index and component stock prices. Their goal was to use a deep learning model to predict the S&P 500 index using the prices of its 500 component stocks. While the original team used a simple fully connected network with four hidden layers, this dataset provides an excellent starting point for beginners to understand how to build a neural network with TensorFlow. This tutorial comprehensively demonstrates the core concepts and modules involved in constructing a TensorFlow model.

Note: This tutorial is based on the older TensorFlow 1.x API, which differs from the current mainstream TensorFlow 2.x. Readers should be aware of version differences. The dataset is still available for download. Experienced readers can experiment with more sequence-aware models like Recurrent Neural Networks (RNNs) or Long Short-Term Memory networks (LSTMs).

Dataset URL: http://files.statworx.com/sp500.zip

Data Import and Preprocessing

The dataset is provided in CSV format, containing 41,266 minutes of records for 500 stocks and the S&P 500 index from April to August 2017. The data has been cleaned, with missing values handled using the Last Observation Carried Forward (LOCF) method, so there are no missing values.

First, import the data and examine its dimensions:

import pandas as pd
import numpy as np

# Import data
data = pd.read_csv('data_stocks.csv')
# Dimensions of dataset
n = data.shape[0]  # Number of rows
p = data.shape[1]  # Number of columns

You can use Matplotlib to plot the S&P 500 time series for a visual understanding of the trend.

Train-Test Split

We split the dataset chronologically, using the first 80% for training and the last 20% for testing. This time-ordered split is crucial for time series forecasting to avoid using future information to predict the past.

# Training and test data
train_start = 0
train_end = int(np.floor(0.8 * n))
test_start = train_end + 1
test_end = n
data_train = data.iloc[train_start:train_end, :].values
data_test = data.iloc[test_start:test_end, :].values

For time series data, more rigorous methods like rolling window forecasting or time series cross-validation exist, but we use a simple static split for this tutorial.

Data Standardization

Neural network inputs typically require standardization to help gradient descent converge more efficiently. Scaling is beneficial even with ReLU activation functions. We use sklearn's MinMaxScaler, but crucially, the scaler must be fit only on the training data and then used to transform the test data to prevent data leakage.

from sklearn.preprocessing import MinMaxScaler

# Scale data
scaler = MinMaxScaler()
scaler.fit(data_train)
data_train_scaled = scaler.transform(data_train)
data_test_scaled = scaler.transform(data_test)

# Build X and y
X_train = data_train_scaled[:, 1:]  # All stock price features (columns 1+)
y_train = data_train_scaled[:, 0]   # Target S&P 500 index (column 0)
X_test = data_test_scaled[:, 1:]
y_test = data_test_scaled[:, 0]

TensorFlow Fundamentals

TensorFlow is an open-source library for numerical computation using data flow graphs. "Tensor" refers to a multi-dimensional array, and "Flow" represents the movement of data through the computational graph. Graphs consist of nodes (operations) and edges (tensors).

A simple graph example adds two scalars:

import tensorflow as tf

# Define a and b as placeholders
a = tf.placeholder(dtype=tf.float32)
b = tf.placeholder(dtype=tf.float32)
# Define the addition
c = tf.add(a, b)

# Initialize and run the graph
with tf.Session() as sess:
    result = sess.run(c, feed_dict={a: 5.0, b: 4.0})
    print(result)  # Output: 9.0

In TensorFlow 1.x, you first define placeholders and operations to build the graph, then run it within a Session, feeding actual data via feed_dict.

Placeholders

Placeholders define input points for data when building the graph. For our model, we need two: one for input features X and one for target values Y.

# Placeholders
n_stocks = 500
X = tf.placeholder(dtype=tf.float32, shape=[None, n_stocks])
Y = tf.placeholder(dtype=tf.float32, shape=[None])

shape=[None, n_stocks] means the input is a 2D tensor with a variable number of rows (batch size) and n_stocks columns. shape=[None] means the target is a 1D tensor.

Variables and Network Parameters

Variables store model parameters (weights and biases) that are updated during training. We build a fully connected network with four hidden layers of 1024, 512, 256, and 128 neurons.

# Model architecture parameters
n_neurons_1 = 1024
n_neurons_2 = 512
n_neurons_3 = 256
n_neurons_4 = 128
n_target = 1

# Initializers
sigma = 1.0
weight_initializer = tf.variance_scaling_initializer(mode="fan_avg", distribution="uniform", scale=sigma)
bias_initializer = tf.zeros_initializer()

# Layer 1: Variables for hidden weights and biases
W_hidden_1 = tf.Variable(weight_initializer([n_stocks, n_neurons_1]))
bias_hidden_1 = tf.Variable(bias_initializer([n_neurons_1]))
# Layer 2
W_hidden_2 = tf.Variable(weight_initializer([n_neurons_1, n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))
# Layer 3
W_hidden_3 = tf.Variable(weight_initializer([n_neurons_2, n_neurons_3]))
bias_hidden_3 = tf.Variable(bias_initializer([n_neurons_3]))
# Layer 4
W_hidden_4 = tf.Variable(weight_initializer([n_neurons_3, n_neurons_4]))
bias_hidden_4 = tf.Variable(bias_initializer([n_neurons_4]))

# Output layer
W_out = tf.Variable(weight_initializer([n_neurons_4, n_target]))
bias_out = tf.Variable(bias_initializer([n_target]))

The dimensions of the weight matrices follow the rule: the number of columns in the current layer's weight matrix equals the number of neurons in the next layer.

Building the Network Architecture

Connect placeholders and variables through matrix multiplication and activation functions to form the forward propagation path. We use ReLU as the activation function for hidden layers.

# Hidden layer with ReLU activation
hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, W_hidden_1), bias_hidden_1))
hidden_2 = tf.nn.relu(tf.add(tf.matmul(hidden_1, W_hidden_2), bias_hidden_2))
hidden_3 = tf.nn.relu(tf.add(tf.matmul(hidden_2, W_hidden_3), bias_hidden_3))
hidden_4 = tf.nn.relu(tf.add(tf.matmul(hidden_3, W_hidden_4), bias_hidden_4))

# Output layer (transposed to match target shape)
out = tf.transpose(tf.add(tf.matmul(hidden_4, W_out), bias_out))

Loss Function and Optimizer

For regression, we use Mean Squared Error (MSE) as the loss function. The optimizer is Adam, an adaptive learning rate algorithm.

# Cost function (MSE)
mse = tf.reduce_mean(tf.squared_difference(out, Y))
# Optimizer
opt = tf.train.AdamOptimizer().minimize(mse)

Training the Neural Network

The training process uses mini-batch gradient descent. In each epoch, we shuffle the training data, split it into batches, and perform forward and backward propagation to update weights and biases.

# Training parameters
epochs = 10
batch_size = 256

# Initialize all variables
init = tf.global_variables_initializer()

# Start training session
with tf.Session() as sess:
    sess.run(init)
    
    for e in range(epochs):
        # Shuffle training data
        shuffle_indices = np.random.permutation(np.arange(len(y_train)))
        X_train_shuffled = X_train[shuffle_indices]
        y_train_shuffled = y_train[shuffle_indices]
        
        # Minibatch training
        for i in range(0, len(y_train_shuffled) // batch_size):
            start = i * batch_size
            batch_x = X_train_shuffled[start:start + batch_size]
            batch_y = y_train_shuffled[start:start + batch_size]
            
            # Run optimizer with batch
            sess.run(opt, feed_dict={X: batch_x, Y: batch_y})
            
            # Optional: Periodically show progress on test set
            if np.mod(i, 50) == 0:
                pred = sess.run(out, feed_dict={X: X_test})
                # mse_test = sess.run(mse, feed_dict={X: X_test, Y: y_test})
                # print(f'Epoch {e}, Batch {i}, Test MSE: {mse_test:.6f}')

During training, you can periodically evaluate model performance on the test set and visualize predictions versus actual values. After training, the model should learn basic patterns in the time series.

Summary and Improvements

This tutorial demonstrated how to build a fully connected neural network for time series prediction using TensorFlow 1.x. While the model is basic, it covers core steps: defining the computation graph, initializing parameters, setting the loss function and optimizer, and the training loop.

To improve model performance, consider:

Using more advanced architectures like LSTMs or GRUs, designed for sequence data.
Experimenting with different hyperparameters (layers, neurons, learning rate).
Introducing regularization techniques like Dropout to prevent overfitting.
Using earlier data splits or rolling window validation for realistic forecasting.
Migrating the code to TensorFlow 2.x using its Eager Execution and Keras high-level API for simpler development.

TensorFlow is a powerful framework offering great flexibility for deep learning. For beginners, understanding these fundamentals is the first step toward more complex models.