The airt-client library has the following main classes:
Clientfor authenticating and accessing the airt service,
DataBlobfor encapsulating the data from sources like CSV files, databases, or AWS S3 bucket, and
DataSourcefor managing datasources and training the models in the airt service.
We import them from airt.client module as follows:
from airt.client import Client, DataBlob, DataSource
To access the airt service, you must create a developer account. Please fill out the signup form below to get one:
Upon successful verification, you will receive the username/password for the developer account in an email.
Finally, you need an application token to access all the APIs in airt service. Please call the
Client.get_token method with the username/password to get one.
You can either pass the username, password, and server address as parameters to the
Client.get_token method or store the same in the AIRT_SERVICE_USERNAME,
AIRT_SERVICE_PASSWORD, and AIRT_SERVER_URL environment variables.
After successful authentication, the airt services will be available to access.
Additionally, you can store the database username and password in environment variables AIRT_CLIENT_DB_USERNAME and AIRT_CLIENT_DB_PASSWORD as well, or pass them as parameters to the
In the below example, the username, password, and server address are stored in AIRT_SERVICE_USERNAME, AIRT_SERVICE_PASSWORD, and AIRT_SERVER_URL environment variables.
1. Data Blob
DataBlob objects are used to encapsulate data access. Currently, we support:
access for local CSV files,
database access for MySql, ClickHouse, and
files stored in AWS S3 bucket.
We plan to support other databases and storage medium in the future.
To create a
DataBlob object, you can call either
DataBlob.from_s3 static methods which imports the data from:
a local CSV file,
a MySql database,
a ClickHouse database, and
an AWS S3 bucket in the Parquet file format respectively.
data_blob = DataBlob.from_mysql( host="db.staging.airt.ai", database="test", table="events" ) data_blob = DataBlob.from_s3( uri="s3://test-airt-service/ecommerce_behavior_csv" )
The above methods will automatically pull the data into airt server and all calls to the library are asynchronous and they return immediately.
To manage completion, all methods will return a status object indicating the status of the completion. Alternatively, you can monitor the completion status interactively in a progress bar by calling the
100%|██████████| 1/1 [00:35<00:00, 35.34s/it]
The next step is to preprocess the data. We currently support preprocessing of CSV and Parquet files. Please use the
DataBlob.from_parquet methods in the
DataBlob class for the same. Support for more file formats will be added in the future.
data_source = data_blob.from_csv( index_column="user_id", sort_by="event_time" ) data_source.progress_bar()
100%|██████████| 1/1 [00:30<00:00, 30.31s/it]
After completition, you can display a head of the data to make sure everything is fine:
The prediction engine is specialized for predicting which clients are most likely to have a specified event in future.
We assume the input data includes the following:
a column identifying a client client_column (person, car, business, etc.),
a colum specifying a type of event we will try to predict target_column (buy, checkout, click on form submit, etc.), and
a timestamp column specifying the time of an occured event.
Each row in the data might have additional columns of int, category, float or datetime type and they will be used to make predictions more accurate. E.g. there could be a city associated with each user or type, credit card used for a transaction, smartphone model used to access a mobile app, etc.
Finally, we need to know how much ahead we wish to make predictions for. E.g. if we predict that a client is most likely to buy a product in the next minute, there is not much we can do anyway. We might be more interested in clients that are most likely to buy a product tomorrow so we can send them a special offer or engage them in some other way. That lead time varies widely from application to application and can be in minutes for a web shop or even several weeks for a banking product such as loan. In any case, there is a parameter predict_after that allows you to specify the time period based on your particular needs.
DataSource.train method is asynchronous and can take a few hours to finish depending on your dataset size. You can check the status by calling the
Model.is_ready method or monitor the completion progress interactively by calling the
In the following example, we will train a model to predict which users will perform a purchase event (*purchase) 3 hours before they acctually do it:
from datetime import timedelta model = data_source.train( client_column="user_id", target_column="event_type", target="*purchase", predict_after=timedelta(hours=3), ) model.progress_bar()
100%|██████████| 5/5 [00:00<00:00, 155.01it/s]
After training is complete, you can check the quality of the model by calling the
Finally, you can run the predictions by calling the
Model.predict method is asynchronous and can take a few hours to finish depending on your dataset size. You can check the status by calling the
Prediction.is_ready method or monitor the completion progress interactively by calling the
predictions = model.predict() predictions.progress_bar()
100%|██████████| 3/3 [00:10<00:00, 3.38s/it]
If the dataset is small enough, you can download the prediction results as a Pandas DataFrame by calling the
In many cases, it's much better to push the prediction results to destinations like AWS S3, MySql database, or even download it to the local machine.
Below is an example to push the prediction results to an s3 bucket. For other available options, please check the documentation of the
status = predictions.to_s3(uri=TARGET_S3_BUCKET) status.progress_bar()
100%|██████████| 1/1 [00:10<00:00, 10.12s/it]