# ML Experiment Tracking and Model Management ML experiment tracking is the process of organizing, recording, and analyzing the results of machine learning experiments. This document explains how to create a workflow to enable ML experiment tracking. You can find the complete ML experiment tracking workflow code in [Treasure Boxes](https://github.com/treasure-data/treasure-boxes/blob/automl/machine-learning-box/automl/ml_experiment.dig) **Table of Contents** * [Track ML Experiments](#track-ml-experiments) * [Record Evaluation Results for each Model](#record-evaluation-results-for-each-model) * [Detect Drift in Model Performance over Time](#detect-drift-in-model-performance-over-time) # Track ML Experiments As a best practice, as part of an end-to-end data processing workflow, you should track each ML experiment using a "*track_experiment"* task following a train task. The *track_experiment* task issues a SQL query to record ML experiment information and the model name into a TD table named "automl_experiments". Sample Workflow Code, is as follows: ```yaml +create_db_tbl_if_not_exists: td_ddl>: null create_databases: - '${ output_database}' create_tables: - automl_experiments - automl_eval_results +train: ml_train>: docker: task_mem: 128g notebook: gluon_train model_name: 'gluon_model_${session_id}' input_table: '${input_database}.${train_data_table}' target_column: '${target_column}' time_limit: '${fit_time_limit}' share_model: true export_leaderboard: '${output_database}.leaderboard_${train_data_table}' export_feature_importance: '${output_database}.feature_importance_${train_data_table}' +track_experiment: td>: queries/track_experiment.sql insert_into: '${output_database}.automl_experiments' last_executed_notebook: '${automl.last_executed_notebook}' user_id: '${automl.last_executed_user_id}' user_email: '${automl.last_executed_user_email}' model_name: 'gluon_model_${session_id}' shared_model: '${automl.shared_model}' task_attempt_id: '${attempt_id}' session_time: '${session_local_time}' engine: presto ``` The above workflow code generates the following example content in the *automl_experiments* table: | task_attempt_id | session_time | user_id | user_email | model_name | shared_model | notebook_url | | --- | --- | --- | --- | --- | --- | --- | | 849779333 | 2023-05-18 7:19:18 | 7776 | xxx@treasure-data.com | gluon_model_161722236 | b4a568da-e6f3-4057-b694-e2e19bf0e924 | https://console.treasuredata.com/app/workflows/automl/notebook/4a3c431b3aea4705b32a47d85ca46368 | | 849772621 | 2023-05-18 7:08:30 | 7776 | xxx@treasure-data.com | gluon_model_161721046 | 94ad5d0e-89ac-4836-99c4-2bc8f975ccbe | https://console.treasuredata.com/app/workflows/automl/notebook/b390b932d4a64fd3a2dc3b75503430fb | | 849768123 | 2023-05-18 7:01:13 | 7777 | yyy@treasure-data.com | gluon_model_161720337 | 4f2351a3-dd8c-418e-8057-4c8ec9a90cbe | https://console.treasuredata.com/app/workflows/automl/notebook/e8b3319c982345a48ff74db0003d7c9c | | 849760942 | 2023-05-18 6:49:50 | 7776 | xxx@treasure-data.com | gluon_model_161718676 | 93e68b09-1a2f-4049-bb89-2bfe596ca9b3 | https://console.treasuredata.com/app/workflows/automl/notebook/b02959b1469e4b9c86ec6c6809acc5ff | | 849753199 | 2023-05-18 6:36:36 | 7776 | xxx@treasure-data.com | gluon_model_161717236 | a7e456d3-8fcf-4173-afb7-f2d58bb985cd | https://console.treasuredata.com/app/workflows/automl/notebook/d3dcbbab99774bd594106a496ec2b2ab | In the table, each records contains model name, details of the user who created the models, the session time when a model is created, and link to the generated notebook. # Record Evaluation Results for each Model You can optionally record each model's quality using an evaluation dataset. The following workflow is an example recording model quality that uses [AUROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic), a standard evaluation measure for classification problems. The `record_evaluation` task records evaluation results in the automl_eval_results table. ```yaml +predict: ml_predict>: docker: task_mem: 64g notebook: gluon_predict model_name: 'gluon_model_${session_id}' input_table: '${input_database}.${test_data_table}' output_table: '${output_database}.predicted_${test_data_table}_${session_id}' +evaluation: td>: queries/auc.sql table: '${output_database}.predicted_${test_data_table}_${session_id}' target_column: '${target_column}' positive_class: ' >50K' store_last_results: true engine: hive +record_evaluation: td>: queries/record_evaluation.sql insert_into: '${output_database}.automl_eval_results' engine: presto model_name: 'gluon_model_${session_id}' test_table: '${input_database}.${test_data_table}' session_time: '${session_local_time}' auc: '${td.last_results.auc}' ``` Treasure Data's Hive execution engine supports Hivemall, which supports a number of evaluation measures. See [Hivemall document for details](https://hivemall.github.io/eval/binary_classification_measures.md) Example content in "automl_eval_results" table: | session_time | model_name | ml_datasets.gluon_test | auroc | | --- | --- | --- | --- | | 2023-06-06 6:21:40 | gluon_model_164947310 | ml_datasets.gluon_test | 0.9226243033 | | 2023-06-14 6:49:22 | gluon_model_166350110 | ml_datasets.gluon_test | 0.9299335758 | | 2023-06-15 7:35:30 | gluon_model_166532223 | ml_datasets.gluon_test | 0.9300292252 | | 2023-05-18 7:19:18 | gluon_model_161722236 | ml_datasets.gluon_test | 0.9238149699 | # Detect Drift in Model Performance over Time "Drift" is a term used in machine learning to describe how the performance of a machine learning model slowly gets worse or stale over time. There are two main types for drifts: data drift and [concept drift](https://en.wikipedia.org/wiki/Concept_drift). Both data drift and concept drift can lead to a decline in the performance of a machine learning model. Using the following workflow tasks, you can records each model's accuracy and quality to detect drift in data and model performance. You can use a scheduled workflow job to keep track of model performance and give a warning if the model performance drifts. There are several schemes for drift detection. See the following example workflow to identify a degradation in ML model performance using an evaluation measure. When a drift is detected, you can trigger an alert email, as follows: ```yaml # timezone: PST # schedule: # daily>: 07:00:00 +evaluation: td>: queries/auc.sql table: '${output_database}.predicted_${test_data_table}_${session_id}' target_column: '${target_column}' positive_class: ' >50K' store_last_results: true engine: hive +alert_if_drift_detected: if>: '${td.last_results.auc < 0.93}' _do: null mail>: null data: 'Detect drift in model performance. AUC was ${td.last_results.auc}.' subject: Drift detected to: - me@example.com bcc: - foo@example.com - bar@example.com ``` You can [schedule workflow executions](https://docs.digdag.io/scheduling_workflow.md?highlight=schedule) for drift detection. And when drift is detected, you can send alert email or rebuild a model using a [conditional operator](https://docs.digdag.io/operators/if.md).