Skip to content
Last updated

Clustering

In Audience Studio, marketers can manually create segments using attribute-based rules to determine which customers are in which segment. This is useful for the scenario when a marketer exactly knows how to split the parent segment into different groups for specific targeted campaigns.

In some cases, marketers may benefit from the automated creation of segments. This clustering notebook uses k-means clustering to group customers based on their attributes and form customer segments. The notebook makes multiple attempts at clustering between the minimum and maximum number of clusters and then finds the ideal number of clusters to maximize the mean Silhouette coefficient. Alternatively, you can provide a specific number of segments to override the automatic calculation.

While this solution notebook is primarily intended for customer segmentation, it performs general k-means clustering and can be applied to any kind of segmentation from customer segmentation to item segmentation.

Expected Inputs

This notebook automatically segments based on the input_table using k-means clustering. The cluster_id is to be assigned to each row in the input_table, and the augmented table can be exported to the Treasure Data table using the output_table option.

An optimal number of clusters is automatically derived by the system between min_clusters and max_clusters, and the default values can be overwritten. If the desired number of clusters is known in advance, you can explicitly set the number of clusters by the num_clusters option.

While any kind of table can be used for the input_table , it's generally recommended to exclude meaningless columns such as rowid and/or userid for better clustering by the ignore_columns option (this notebook automatically ignores columns having a single value).

Expected Outputs

This notebook performs basic EDA (Exploratory Data Analysis) and XAI (eXplainable AI) using feature GINI importance and Shapley values on the clustering labels using the RandomForest classifier.

The following plot graph shows the top three features for each cluster.

Screen Shot 2023-04-06 at 15 15 54

The following plot shows mean SHAP values for each cluster. This graph illustrates which attributes contribute the most to the cluster assignment.

Screen Shot 2023-05-23 at 13 35 34

Workflow Example

Find a sample workflow here in Treasure Boxes.

+clustering_gluon:  
  ipynb>:
    notebook: clustering  
    input_table: ml_datasets.gluon_train  
    output_table: ml_test.gluon_train_clustered_${session_id}

Parameters

Parameter NameParameter on ConsoleDescriptionRequiredDefault Valueexample value
input_tablespecify a TD table used for clustering as dbname.table_nameyesstring (dbname.table_name)
ml_dataset.gluon_train
output_tablespecify a TD table to export clustering results as dbname.table_namenostring (dbname.table_name)
ml_output.cluster
model_nameoptionally specify a model name to save. No need to set in general.nostring
gluon_model
force_refitforce fitting even with an existing trained model. Note that setting force_refit to false is an experimental option.nobooleantruetrue
output_modeoutput mode for exporting output_table. Usually, there is no need to specify.nostring (overwrite/replace or append)overwriteoverwrite
min_clustersspecify a minimum number of clustersnointeger25
max_clustersspecify a maximum number of clustersnointeger925
num_clustersspecify a fixed number of clustersnointegerNone3
ignore_columnscolumns to ignore for building a prediction modelnostring (comma separated)timetime, rowid
dimension_reduction_thresholdthreshold used for dimension reduction.nointeger5030
export_feature_importanceexport feature importance as a TD table if specified.nostring ([dbname.]table_name)Noneml_test.feature_importance
export_shap_valuesexport SHAP values for each cluster as a TD tablenostring ([dbname.]table_name)Noneml_test.shap_values
hide_table_contentssuppress showing table contentsnobooleanfalsefalse
audience_nameAudience name to merge an attribute tablenostringNonemy_master_segment_name
foreign_keyforeign key column name of a master segment used for Audience integration.nostringNonetd_canonical_id
rowid_columnrowid (primary key) column in the input_table. Required and used as an attribute table join key for Audience integration.nostringNoneuserid

Audience integration
To add an attribute table to a CDP master segment, set all three options: audience_name , foreign_key , and rowid_column. The rowid_column is the join key in the output_table to be joined within the audience master table. Note CDP segments are automatically generated for each cluster when those options are set. Marketers can modify the cluster rules in the Audience Studio Rule Builder further. For example, using other attributes, you can combine the generated segments with additional rules.

Setting the parameter force_refit to false and using pre-computed models (k-means centroids) is an experimental feature, and it is not recommended to change from the default true option.