Skip to content

Tutorial

The airt-client library has the following two main classes:

  • Client for authenticating and accessing the airt service, and

  • DataSource for encapsulating data sources such as S3 bucket or a database.

We import them from airt.client module as follows:

from airt.client import Client, DataSource

Authentication

Before you can use the service, you must acquire a username and password for your developer account. Please fill in the following form to get one:

Upon successfully receiving a username/password pair, you have to call the Client.get_token method in the Client class for getting an application token.

The username, password, and server address can either be passed explicitly while calling the Client.get_token method or stored in environment variables AIRT_SERVICE_USERNAME, AIRT_SERVICE_PASSWORD, and AIRT_SERVER_URL.

Additionally, you can store the database username and password in environment variables AIRT_CLIENT_DB_USERNAME and AIRT_CLIENT_DB_PASSWORD as well, or pass them as parameters to the DataSource.db method.

The below example assumes the username, password, and server address required for getting an access token is stored in the environment variables AIRT_SERVICE_USERNAME, AIRT_SERVICE_PASSWORD, and AIRT_SERVER_URL respectively.

Client.get_token()

1. Data Source

DataSource objects are used to encapsulate data access. Currently, we support:

  • database access for MySql, and

  • files stored in AWS S3 bucket in the Parquet file format.

We plan to add other databases and storage medium in the future.

To create a data source object, you can call either the DataSource.db method or DataSource.s3 method as follows:

data_source_db = DataSource.db(
    host="db.staging.airt.ai",
    database="test",
    table="events"
)

data_source_s3 = DataSource.s3(
    uri="s3://test-airt-service/ecommerce_behavior"
)

The objects created in such a way are not checked yet. To check them, you should call DataSource.pull method.

All calls to the library are asynchronous and they return immediately. To manage completion, all methods will return a status object indicating the status of the completion. Alternatively, you can monitor the completion status interactively in a progress bar by calling the ProgressStatus.progress_bar method:

status = data_source_s3.pull()

status.progress_bar()
100%|██████████| 1/1 [00:40<00:00, 40.45s/it]
assert status.is_ready()

After completition, you can display a head of the data to make sure everything is fine:

data_source_s3.head()
event_time event_type product_id category_id category_code brand price user_id user_session
0 2019-11-01T00:00:00+00:00 view 1003461 2053013555631882655 electronics.smartphone xiaomi 489.07 520088904 4d3b30da-a5e4-49df-b1a8-ba5943f1dd33
1 2019-11-01T00:00:00+00:00 view 5000088 2053013566100866035 appliances.sewing_machine janome 293.65 530496790 8e5f4f83-366c-4f70-860e-ca7417414283
2 2019-11-01T00:00:01+00:00 view 17302664 2053013553853497655 None creed 28.31 561587266 755422e7-9040-477b-9bd2-6a6e8fd97387
3 2019-11-01T00:00:01+00:00 view 3601530 2053013563810775923 appliances.kitchen.washer lg 712.87 518085591 3bfb58cd-7892-48cc-8020-2f17e6de6e7f
4 2019-11-01T00:00:01+00:00 view 1004775 2053013555631882655 electronics.smartphone xiaomi 183.27 558856683 313628f1-68b8-460d-84f6-cec7a8796ef2

2. Training

The prediction engine is specialized for predicting which clients are most likely to have a specified event in future.

We assume the input data includes the following:

  • a column identifying a client client_column (person, car, business, etc.),

  • a colum specifying a type of event we will try to predict target_column (buy, checkout, click on form submit, etc.), and

  • a timestamp column specifying the time of an occured event.

Each row in the data might have additional columns of int, category, float or datetime type and they will be used to make predictions more accurate. E.g. there could be a city associated with each user or type, credit card used for a transaction, smartphone model used to access a mobile app, etc.

Finally, we need to know how much ahead we wish to make predictions for. E.g. if we predict that a client is most likely to buy a product in the next minute, there is not much we can do anyway. We might be more interested in clients that are most likely to buy a product tomorrow so we can send them a special offer or engage them in some other way. That lead time varies widely from application to application and can be in minutes for a web shop or even several weeks for a banking product such as loan. In any case, there is a parameter predict_after that allows you to specify the time period based on your particular needs.

The DataSource.train method is asynchronous and can take a few hours to finish depending on your dataset size. You can check the status by calling the Model.is_ready method or monitor the completion progress interactively by calling the Model.progress_bar method.

In the following example, we will train a model to predict which users will perform a purchase event (*purchase) 3 hours before they acctually do it:

from datetime import timedelta

model = data_source_s3.train(
    client_column="user_id",
    target_column="event_type",
    target="*purchase",
    predict_after=timedelta(hours=3),
)

model.progress_bar()
100%|██████████| 5/5 [00:00<00:00, 119.56it/s]
assert model.is_ready()

After training is complete, you can check the quality of the model by calling the Model.evaluate method.

model.evaluate()
eval
accuracy 0.985
recall 0.962
precision 0.934

3. Predictions

Finally, you can run the predictions by calling the Model.predict method.

The Model.predict method is asynchronous and can take a few hours to finish depending on your dataset size. You can check the status by calling the Prediction.is_ready method or monitor the completion progress interactively by calling the Prediction.progress_bar method.

predictions = model.predict()

predictions.progress_bar()
100%|██████████| 3/3 [00:00<00:00, 78.68it/s]
assert predictions.is_ready()

If the dataset is small enough, you can download the result of prediction locally as a Pandas DataFrame object as follows:

predictions.to_pandas()
Score
user_id
520088904 0.979853
530496790 0.979157
561587266 0.979055
518085591 0.978915
558856683 0.977960
520772685 0.004043
514028527 0.003890
518574284 0.001346
532364121 0.001341
532647354 0.001139

In many case, a much better way is to directly push the result to a data source, in the following case to the AWS S3 bucket:

data_source_pred = DataSource.s3(
    uri="s3://target-bucket"
)

predictions.push(data_source_pred)

# Alternatively, this should also work
# data_source_s3.push(predictions)
Back to top