A Neural Network For Regression On Small Data

4 minute read

I would like to see how to build a basic neural network for regression problems on small datasets. The aim is to see how well a neural net can perform when using 1,000 data points or fewer to train the model. Gaussian processes typically perform well on these problems and so I will be using this as a baseline upon which to compare the neural net.

Requirements

Dataset

I have just chosen a relatively simple dataset to perform some studies on. I won’t go into detail on how it was generated, it is sufficient to say that there are approximately 20,000 data points and 200 features, with each data point having a single target. The dataset I am using, example_data.csv, can be downloaded from the repository.

Loading Data

To get the data from the raw csv format, I use the following code:

  import numpy as np
  import pandas as pd

  from sklearn.preprocessing import scale

  # load and set up data
  data = pd.read_csv('example_data.csv')
  data = data.sample(data.shape[0])

  # preprocess
  X = scale(np.array(data.iloc[:, :-2]))
  Y = np.array(data.iloc[:, -2])

  # build training and test sets
  train_size = 1000
  X_train, Y_train = X[:train_size, :], Y[:train_size]
  X_test, Y_test = X[train_size:, :], Y[train_size:]

This simply reads in all the data, shuffles it and then separates the features and the targets. Then, all the features are scaled, though this method of globally scaling all train and test data together is not so realistic, but will work for this example. Finally, the data set is split into the train and test sets, resulting in four arrays.

The train_size parameter defines the size of the training dataset. In this example, the remaining data will be used as the test data set.

General

In order to assess the various methods, I am using a root mean squared error cost function. As defined in the following snippet of code:

  def cost(prediction, target):
    assert np.shape(prediction) == np.shape(target)
    r = prediction - target

    return np.sum(r**2) / len(target)

This will give the average error on the entire dataset. For comparative purposes, the code to set up a Gaussian process is as follows:

  from sklearn.gaussian_process import GaussianProcessRegressor
  from sklearn.gaussian_process.kernels import RBF

  # set up Gaussian process regression
  gp_kernel = 1. * RBF(1.)
  gp = GaussianProcessRegressor(
      kernel=gp_kernel, alpha=1e-1,
      n_restarts_optimizer=100).fit(X_train, Y_train)
  p = gp.predict(X_test)
  e = cost(p, Y_test)

  print('GP error: {}'.format(e))

The radial basis function kernel is utilized along with a regularization applied to the covariance function. Hyperparameter optimization is also performed.

Neural Network

The starting setup of the neural network is as follows:

  from sklearn.neural_network import MLPRegressor

  # set up neural network for regression
  nn = MLPRegressor(
      hidden_layer_sizes=(100,)
      ).fit(X_train, Y_train)
  p = nn.predict(X_test)
  e = cost(p, Y_test)

  print('NN error: {}'.format(e))

This just sets up a single hidden layer with 100 neurons (the default number). All other parameters in the model are initially set to the default.

Results

For the default parameters of the neural net and a training data size of 100 data points, it is possible to get an accuracy of approximately 0.6 RMSE, on a range of approximately 20. This is pretty good, but it seems like there could be an improvement here, a Gaussian process with radial basis function kernel will typically perform approximately 35% better than the neural net.

The default number of neurons (100), in a single hidden layer, gives good results, increasing the number of neurons or layers tends to give worse results. The default activation function, relu, also appears to be pretty optimal. It is rarely that any of the other functions perform better through repeated sampling.

In order to improve this, the lbfgs solver can be used, instead of the default adam optimizer. The sklearn documentation suggests that this can perform better for small data sizes. This appears to give a relatively consistent improvement on the default neural net setup, where using the lbfgs solver, there is a 15% improvement on the adam solver.

The default L2-regularization of 1e-4 is perhaps a little small, with the potential to obtain more accurate predictions when it is set to slightly larger values in the interval 1e-1 to 1e-4. However, this generally has relatively little impact on the accuracy of predictions. The max_iter parameter can be set slightly larger when the regularization value is also larger (e.g. 1e-1).

Conclusions

After some optimization of the neural network, it now gives comparable results to the Gaussian process. The following code was used to obtain these results from the neural net:

  # set up neural network for regression
  nn = MLPRegressor(
      hidden_layer_sizes=(100,), solver='lbfgs',
      activation='relu', alpha=0.1,
      learning_rate='constant', max_iter=500
      ).fit(X_train, Y_train)
  p = nn.predict(X_test)
  e = cost(p, Y_test)

  print('NN error: {}'.format(e))

This process has resulted in a relatively good neural network, even if the setup is a little basic. In reality, there are many more steps that can be taken to produce a more optimized model, but this has produced good results. The final script, sklearn_nn.py, can be found in the repository.

Updated: