Quick-start guide
Install into your Python 3 environment using pip:
pip install bytehub
Then launch a Python session and run the following to create a new feature store:
import bytehub as bh
fs = bh.FeatureStore()
By default, ByteHub creates a local sqlite database called
bytehub.db
. Passing a SQLAlchemy connection string allows you to connect to a remote database hosting the feature store.Next, create a namespace called
tutorial
to store some features in. Edit the url
field to specify a local file storage location that you would like to use. Feature values will be saved within this folder using parquet format.fs.create_namespace(
'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)
Now create a new feature inside this namespace:
fs.create_feature('tutorial/numbers', description='Timeseries of numbers')
Now we can generate a Pandas dataframe with
time
and value
columns to store:dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})
fs.save_dataframe(df, 'tutorial/numbers')
Now for some feature engineering. Suppose we want to create another feature called
tutorial/squared
that contains the square of every value in tutorial/number
. To do this, define a transform as follows:@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
return df ** 2 # Square the input
The transform receives a dataframe of everything in
from_features
and should return a series/dataframe of transformed timeseries values. We can now look at some of our timeseries data using:df_query = fs.load_dataframe(
['tutorial/numbers', 'tutorial/squared'],
from_date='2021-01-01', to_date='2021-01-31'
)
print(df_query.head())
Using the
load_dataframe
method, we can easily join, resample and filter the features. For example to get a monthly timeseries for 2020 we could run:df_query = fs.load_dataframe(
['tutorial/numbers', 'tutorial/squared'],
from_date='2020-01-01', to_date='2020-12-31', freq='1M'
)
print(df_query.head())
Up until now all of our data has been stored locally in the url we set when creating the namespace. To store data on a cloud storage service we need to create another namespace:
fs.create_namespace(
'cloud',
url='s3://bytehub-test-bucket/tutorial',
description='Cloud tutorial',
storage_options={ # Credentials to access S3 bucket
'key': AWS_ACCES_KEY_ID,
'secret': AWS_SECRET_ACCESS_KEY
}
)
See the Dask documentation for information on url formats and the different
storage_options
that you may want to set.The simplest way to move our existing data to the cloud storage is with the
clone_feature
method:fs.clone_feature('cloud/numbers', from_name='tutorial/numbers')
fs.clone_feature('cloud/squared', from_name='tutorial/squared')
Now both of the features we created are available in the new, S3-backed namespace. View all of the features that we created in the tutorial by running:
fs.list_features()
Last modified 2yr ago