Quick-start guide

Getting started

This guide is available as a Colab notebook.

Install into your Python 3 environment using pip:

pip install bytehub

Then launch a Python session and run the following to create a new feature store:

import bytehub as bh

fs = bh.FeatureStore()

By default, ByteHub creates a local sqlite database called bytehub.db. Passing a SQLAlchemy connection string allows you to connect to a remote database hosting the feature store.

Next, create a namespace called tutorial to store some features in. Edit the url field to specify a local file storage location that you would like to use. Feature values will be saved within this folder using parquet format.

fs.create_namespace(
    'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)

Now create a new feature inside this namespace:

fs.create_feature('tutorial/numbers', description='Timeseries of numbers')

Now we can generate a Pandas dataframe with time and value columns to store:

dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})

fs.save_dataframe(df, 'tutorial/numbers')

Now for some feature engineering. Suppose we want to create another feature called tutorial/squared that contains the square of every value in tutorial/number. To do this, define a transform as follows:

@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
    return df ** 2 # Square the input

The transform receives a dataframe of everything in from_features and should return a series/dataframe of transformed timeseries values. We can now look at some of our timeseries data using:

df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2021-01-01', to_date='2021-01-31'
)
print(df_query.head())

Using the load_dataframe method, we can easily join, resample and filter the features. For example to get a monthly timeseries for 2020 we could run:

df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2020-01-01', to_date='2020-12-31', freq='1M'
)
print(df_query.head())

Up until now all of our data has been stored locally in the url we set when creating the namespace. To store data on a cloud storage service we need to create another namespace:

fs.create_namespace(
    'cloud',
    url='s3://bytehub-test-bucket/tutorial',
    description='Cloud tutorial',
    storage_options={ # Credentials to access S3 bucket
        'key': AWS_ACCES_KEY_ID,
        'secret': AWS_SECRET_ACCESS_KEY
    }
)

See the Dask documentation for information on url formats and the different storage_options that you may want to set.

The simplest way to move our existing data to the cloud storage is with the clone_feature method:

fs.clone_feature('cloud/numbers', from_name='tutorial/numbers')
fs.clone_feature('cloud/squared', from_name='tutorial/squared')

Now both of the features we created are available in the new, S3-backed namespace. View all of the features that we created in the tutorial by running:

fs.list_features()

PreviousConcepts NextByteHub Cloud

Last updated 3 years ago