Timeseries forecasting

Tensorflow has an excellent tutorial on how to build forecasting models for timeseries data. This example implements the same model using ByteHub, with the following benefits:

  • The raw data can be stored, searched, and queried on-demand so that it can easily be reused between different models/projects.

  • The data preparation and feature engineering steps are implemented using feature transforms, allowing them to be decoupled from the rest of the model code.

This code for this tutorial is available as a Colab notebook.

The first part of the code deals with data preparation. We download the raw weather data, rename the columns as ByteHub features, then save them simply by running:

fs.save_dataframe(df)

We can also store additional information about each feature as metadata. In this case we save the unit of each variable (degC, mbar, m/s, etc) using:

fs.create_feature('tutorial/' + name, partition='year', meta={'unit': unit})

Adding extra information in this way makes the dataset more useful: all of the features and their metadata can be viewed by running:

fs.list_features()

The original Tensorflow tutorial includes

  • Data preparation: removing bad values from the wind-speed data; and

  • Feature engineering: creating new features such as wind x- and y-speeds, and cyclical time variables.

Both of these can easily be created using ByteHub's feature transforms, allowing them to be stored and easily reused. For example, the bad wind-speed values can be removed using:

@fs.transform('tutorial/cleaned.wv', from_features=['tutorial/raw.wv'])
def clean_wind_data(df):
  bad_data = df['tutorial/raw.wv'] < 0
  df.loc[bad_data, 'tutorial/raw.wv'] = 0
  return df

The feature engineering on the wind data involves converting the wind direction into radians, then using trigonometry to find the x- and y-components of wind speed:

@fs.transform('tutorial/transformed.wind-x', from_features=['tutorial/cleaned.wv', 'tutorial/raw.wd'])
def wind_x(df):
  radians = df['tutorial/raw.wd'] * np.pi / 180
  return df['tutorial/cleaned.wv'] * np.cos(radians)
  
@fs.transform('tutorial/transformed.wind-y', from_features=['tutorial/cleaned.wv', 'tutorial/raw.wd'])
def wind_y(df):
  radians = df['tutorial/raw.wd'] * np.pi / 180
  return df['tutorial/cleaned.wv'] * np.sin(radians)

All of these newly transformed features are searchable, for example by using:

fs.list_features(regex=r'transformed\..')

Finally we can build a training dataset for a model simply by defining the features we want, then calling fs.load_dataframe. In the process we can also filter by date range, or resample so that we have a consistently sampled dataset to feed into the model.

model_features = [
  'tutorial/raw.T', # We will be forecasting this variable
  'tutorial/raw.p', 'tutorial/raw.Tpot', 'tutorial/raw.Tdew',
  'tutorial/raw.rh', 'tutorial/raw.VPmax', 'tutorial/raw.VPact',
  'tutorial/raw.VPdef', 'tutorial/raw.sh', 'tutorial/raw.H2OC',
  'tutorial/raw.rho', 'tutorial/cleaned.wv', 'tutorial/cleaned.max.wv',
  'tutorial/transformed.wind-x', 'tutorial/transformed.wind-y',
  'tutorial/transformed.maxwind-x','tutorial/transformed.maxwind-y',
  'tutorial/transformed.day.sine', 'tutorial/transformed.day.cosine',
  'tutorial/transformed.year.sine', 'tutorial/transformed.year.cosine'
]
df = fs.load_dataframe(model_features, freq='1h', from_date='2009-01-01 01:00')

This returns all of the data we need, computing all of our transformations on-the-fly, meaning that they are automatically updated if any of the underlying data is updated. Now we can proceed with model training and evaluation.

The model is designed to forecast the temperature variable, tutorial/raw.T, one step ahead into the future using 24hrs of past data. LSTM models are one approach for this type of problem. In Tensorflow/Keras, the LSTM model can be built using:

def build_model(df):
  # Fit a normalisation layer on the input data
  norm = tf.keras.layers.experimental.preprocessing.Normalization()
  # Add an extra dimension to the normalisation (this will be used for the time sequences)
  norm.adapt(df.values[:, np.newaxis, :])
  lstm_model = tf.keras.models.Sequential(
      [
       # Normalisation
       norm,
       tf.keras.layers.LSTM(32, return_sequences=False),
       tf.keras.layers.Dense(units=1)
      ]
  )
  return lstm_model

The data also needs to be windowed, i.e. converted into sequences of 24 steps to be fed as input:

def make_dataset(features, targets, sequence_length=24, shuffle=False, batch_size=1):
  ds = tf.keras.preprocessing.timeseries_dataset_from_array(
      features[:-sequence_length], targets[sequence_length:], sequence_length, batch_size=batch_size, shuffle=shuffle
  )
  return ds
  
train_ds = make_dataset(train_df, train_df[['tutorial/raw.T']], shuffle=True, batch_size=32)

Finally, we can train and evaluate the model. By keeping the data prep and feature engineering logic in the feature store, it becomes much more straightforward to reuse the same features in different models. All of the preparatory steps can be kept separate and decoupled from the model code, making each part simpler to understand and easier to maintain.

Last updated