Data Preparation

This tutorial covers how to use feature transforms to carry out data preparation. This allows you to re-format your data in a clear, reusable way, creating features that can then be fed to your machine-learning models.

This tutorial is available as a Colab notebook.

For this demo we'll use the UK Carbon Intensity API, which provides data on the CO2 emissions associated with electricity supply. In common with many data sources, the API returns values in a JSON format containing timestamps, along with forecast and actual carbon intensity values. Because ByteHub supports complex data structures as timeseries values, we can load this raw data straight into the feature store:

from_date = pd.Timestamp('2021-01-01')
to_date = pd.Timestamp('2021-01-31')

response = requests.get(
  f'https://api.carbonintensity.org.uk/intensity/{from_date.strftime("%Y-%m-%dT%H:%MZ")}/{to_date.strftime("%Y-%m-%dT%H:%MZ")}',
)
response.raise_for_status()

df_carbon = pd.DataFrame(
    {
        'time': pd.to_datetime([row['from'] for row in response.json()['data']]).tz_localize(None),
        'value': [row['intensity'] for row in response.json()['data']]
    }
)

# Load this data into the feature store
fs.create_feature('tutorial/rawdata.carbon', partition='year', serialized=True)
fs.save_dataframe(df_carbon, 'tutorial/rawdata.carbon')

This raw JSON format will be hard to work with if we try to feed it straight into an analysis script or ML model, so instead we can prepare it into simple features. For example, we might want to unpack the actual and forecast carbon intensity values from each JSON object in the raw dataset. To do this, create a function that prepares the data, then decorate it using fs.transform:

# Extract carbon intensity forecast from dictionary
@fs.transform('tutorial/feature.carbon-forecast', from_features=['tutorial/rawdata.carbon'])
def unpack_carbon_forecast(df):
  return df['tutorial/rawdata.carbon'].apply(lambda x: x['forecast'])

# Extract carbon intensity actuals from dictionary
@fs.transform('tutorial/feature.carbon-actual', from_features=['tutorial/rawdata.carbon'])
def unpack_carbon_actual(df):
  return df['tutorial/rawdata.carbon'].apply(lambda x: x['actual'])

In these examples:

  • The decorator defines the name of the transformed feature, e.g. tutorial/feature.carbon-forecast, and the raw feature from which it is derived;

  • The function itself should accept a dataframe argument, df, which will contain the raw data, and return either a series or dataframe of transformed data.

Transform functions must be completely self-contained, so if your transform depends on a library be sure to import it inside of your function.

Having defined these transforms we can now easily query them. The feature store will compute the transform on the fly, meaning that any changes to the underlying source data will be available immediately.

# Query transformed features and resample to 1H frequency
df_prepared_data = fs.load_dataframe(
    ['tutorial/feature.carbon-forecast', 'tutorial/feature.carbon-actual'],
    from_date=from_date,
    freq='1h'
)
df_prepared_data.head()

Preparing your data in this way allows you to separate data prep from the rest of your model code. This has several advantages:

  • Your model definition and training code becomes simpler and easier to maintain/understand, because it is no longer cluttered by data preparation steps;

  • Your data prep is now saved in the feature store and easily reusable, for example between model train and production deployments, or between different models/projects that share the same datasets.

Last updated