Skip to content
Last updated

ML Datasets

This notebook generates sample ML datasets in the specified output database.

Workflow Example

Find a sample workflow here in Treasure Boxes.

+load_datasets:
  ipynb>:
    notebook: ml_datasets
    output_database: ml_datasets
    datasets: all  

Parameters

Parameter nameParameter on ConsoleDescriptionDefault Value
docker.task_memDocker Task MemTask memory size. Available values are 64g, 128g (default), 256g, 384g, or 512g depending on your contracted tiers.128g
datasetsDatasetsAn "all" or comma separated string to specify datasets to set up.all
output_databaseOutput DatabaseDataset name to set up.ml_datasets
replace_if_existsReplace If ExistsReplace a table if it already exists. Set to false by default.false

Dataset Description

DatasetDescriptionAssociated TasksTarget ColumnNumber of ColumnsNumber of Rows
gluonAutoGluon example dataset.Binary / Multiclass classificationclass (binary), occupation (multiclass)1539,073 (train), 9,769 (test)
bank_marketingBank marketing dataset.
Predict if the client will subscribe a term deposit.Binary classificationy2128,831 (train), 12,357(test)
vehicle_couponVehicle coupon recommendation dataset.
Recommend a coupon to driver on different scenarios.Multiclass classificationcoupon268,878 (train), 3,806 (test)
online_retailOnline retail transactional dataset.
Predict LTV score for each customer.Regression (CLTV prediction), RFMcltv112,230 (train), 956 (test)
telco_churnTelco churn event dataset.Binary classification (Churn prediction)churn214,930 (train), 2,113 (test)
california_houseHouse price dataset of California.
Predict house prices.Regressionmedian_house_value1014,448 (train), 6,192 (test)
transition_matrixSample transition dataset of web access.
Analyze web access transitions.Network Analysis-312
ts_airlineTime-series airline passenger dataset.
Forecast the number of passengers.Time-series Forecasting (Univariate)number_of_airline_passengers2100 (train), 44 (test)
m4Quarterly time series of M4 dataset.Time-series Forecasting (Multivariate)v7 (or any v?)86733,600 (train), 14,400 (test)
nbaNext-Best-Action dataset.Next Best Action-643,196 (train), 12,829 (test)
mtaDP6 dataset for marketing attribution models.Multi-Touch Attribution-4500,000
dermatologyDermatology diseases dataset.
Determine 6 types of Eryhemato-Squamous disease.Multi-class classification, Clusteringclass35366
creditcardCredit card fraud dataset.
Predict anonymized transactions as fraudulent or genuine.Binary classification (Fraud detection)fraud29199,364 (train), 85,443 (test)
clutoCluto dataset for clustering.Clusteringclass310,000
covtypeForestcover type dataset.
Classification of pixels into 7 forest cover types.Multiclass classificationtarget55406,708 (train), 174,304(test)
20newsgroups20 newsgroup documents dataset.
This data set comes from data in 20 different newsgroups.Multiclass classificationtarget30111,314 (train), 7,532 (test)
4,871 (inbalanced train)
cosmetics_storeCosmetics shop e-commerce events history dataset.RFM analysis, Clustering-51,287,007