ML Datasets
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude
Connect to Cursor
Install MCP server on Cursor
Connect to VS Code
Install MCP server on VS Code

This notebook generates sample ML datasets in the specified output database.

Workflow Example

Find a sample workflow here in Treasure Boxes.

+load_datasets:
  ipynb>:
    notebook: ml_datasets
    output_database: ml_datasets
    datasets: all

Parameters

Parameter name	Parameter on Console	Description	Default Value
docker.task_mem	Docker Task Mem	Task memory size. Available values are 64g, 128g (default), 256g, 384g, or 512g depending on your contracted tiers.	128g
datasets	Datasets	An "all" or comma separated string to specify datasets to set up.	all
output_database	Output Database	Dataset name to set up.	ml_datasets
replace_if_exists	Replace If Exists	Replace a table if it already exists. Set to false by default.	false

Dataset Description

Dataset	Description	Associated Tasks	Target Column	Number of Columns	Number of Rows
gluon	AutoGluon example dataset.	Binary / Multiclass classification	class (binary), occupation (multiclass)	15	39,073 (train), 9,769 (test)
bank_marketing	Bank marketing dataset.
Predict if the client will subscribe a term deposit.	Binary classification	y	21	28,831 (train), 12,357(test)
vehicle_coupon	Vehicle coupon recommendation dataset.
Recommend a coupon to driver on different scenarios.	Multiclass classification	coupon	26	8,878 (train), 3,806 (test)
online_retail	Online retail transactional dataset.
Predict LTV score for each customer.	Regression (CLTV prediction), RFM	cltv	11	2,230 (train), 956 (test)
telco_churn	Telco churn event dataset.	Binary classification (Churn prediction)	churn	21	4,930 (train), 2,113 (test)
california_house	House price dataset of California.
Predict house prices.	Regression	median_house_value	10	14,448 (train), 6,192 (test)
transition_matrix	Sample transition dataset of web access.
Analyze web access transitions.	Network Analysis	-	3	12
ts_airline	Time-series airline passenger dataset.
Forecast the number of passengers.	Time-series Forecasting (Univariate)	number_of_airline_passengers	2	100 (train), 44 (test)
m4	Quarterly time series of M4 dataset.	Time-series Forecasting (Multivariate)	v7 (or any v?)	867	33,600 (train), 14,400 (test)
nba	Next-Best-Action dataset.	Next Best Action	-	6	43,196 (train), 12,829 (test)
mta	DP6 dataset for marketing attribution models.	Multi-Touch Attribution	-	4	500,000
dermatology	Dermatology diseases dataset.
Determine 6 types of Eryhemato-Squamous disease.	Multi-class classification, Clustering	class	35	366
creditcard	Credit card fraud dataset.
Predict anonymized transactions as fraudulent or genuine.	Binary classification (Fraud detection)	fraud	29	199,364 (train), 85,443 (test)
cluto	Cluto dataset for clustering.	Clustering	class	3	10,000
covtype	Forestcover type dataset.
Classification of pixels into 7 forest cover types.	Multiclass classification	target	55	406,708 (train), 174,304(test)
20newsgroups	20 newsgroup documents dataset.
This data set comes from data in 20 different newsgroups.	Multiclass classification	target	301	11,314 (train), 7,532 (test)
4,871 (inbalanced train)
cosmetics_store	Cosmetics shop e-commerce events history dataset.	RFM analysis, Clustering	-	5	1,287,007

ML DatasetsCopyCopy for LLMCopy page as Markdown for LLMsView as MarkdownOpen this page as MarkdownOpen in ChatGPTGet insights from ChatGPTOpen in ClaudeGet insights from ClaudeConnect to CursorInstall MCP server on CursorConnect to VS CodeInstall MCP server on VS Code

Workflow Example

Parameters

Dataset Description

Was this helpful?

ML Datasets
Copy for LLM
Copy page as Markdown for LLMs
View as Markdown
Open this page as Markdown
Open in ChatGPT
Get insights from ChatGPT
Open in Claude
Get insights from Claude
Connect to Cursor
Install MCP server on Cursor
Connect to VS Code
Install MCP server on VS Code