Using Cloud Storage

The simplest way to use ByteHub is to store feature data. However cloud storage provides like AWS S3, Azure blob, GCP cloud storage are also supported allowing large datasets to be stored and shared.

This tutorial is available as a Colab notebook.

ByteHub uses Dask to store data. To use a cloud storage service you will need to create a namespace with url and storage_options configured for your provide. For example using AWS S3:

fs.create_namespace(
's3-demo',
url='s3://my-bucket-name/demo',
description='S3 tutorial',
storage_options={
'key': aws_access_key_id, 'secret': aws_secret_access_key, 'use_ssl': True
}
)

The Dask remote storage documentation details the configuration required for different cloud providers. In summary:

Cloud

URL format

Storage options

AWS

's3://{bucket_name}/{folder_name}'

key, secret

Azure

'abfs://{container_name}/{folder_name}'

account_name ,account_key

GCP

'gcs://{bucket_name}/{folder_name}'

token

‚Äč