This notebook generates sample ML datasets in the specified output database.
Find a sample workflow here in Treasure Boxes.
+load_datasets:
ipynb>:
notebook: ml_datasets
output_database: ml_datasets
datasets: all | Parameter name | Parameter on Console | Description | Default Value |
|---|---|---|---|
| docker.task_mem | Docker Task Mem | Task memory size. Available values are 64g, 128g (default), 256g, 384g, or 512g depending on your contracted tiers. | 128g |
| datasets | Datasets | An "all" or comma separated string to specify datasets to set up. | all |
| output_database | Output Database | Dataset name to set up. | ml_datasets |
| replace_if_exists | Replace If Exists | Replace a table if it already exists. Set to false by default. | false |
| Dataset | Description | Associated Tasks | Target Column | Number of Columns | Number of Rows |
|---|---|---|---|---|---|
| gluon | AutoGluon example dataset. | Binary / Multiclass classification | class (binary), occupation (multiclass) | 15 | 39,073 (train), 9,769 (test) |
| bank_marketing | Bank marketing dataset. | ||||
| Predict if the client will subscribe a term deposit. | Binary classification | y | 21 | 28,831 (train), 12,357(test) | |
| vehicle_coupon | Vehicle coupon recommendation dataset. | ||||
| Recommend a coupon to driver on different scenarios. | Multiclass classification | coupon | 26 | 8,878 (train), 3,806 (test) | |
| online_retail | Online retail transactional dataset. | ||||
| Predict LTV score for each customer. | Regression (CLTV prediction), RFM | cltv | 11 | 2,230 (train), 956 (test) | |
| telco_churn | Telco churn event dataset. | Binary classification (Churn prediction) | churn | 21 | 4,930 (train), 2,113 (test) |
| california_house | House price dataset of California. | ||||
| Predict house prices. | Regression | median_house_value | 10 | 14,448 (train), 6,192 (test) | |
| transition_matrix | Sample transition dataset of web access. | ||||
| Analyze web access transitions. | Network Analysis | - | 3 | 12 | |
| ts_airline | Time-series airline passenger dataset. | ||||
| Forecast the number of passengers. | Time-series Forecasting (Univariate) | number_of_airline_passengers | 2 | 100 (train), 44 (test) | |
| m4 | Quarterly time series of M4 dataset. | Time-series Forecasting (Multivariate) | v7 (or any v?) | 867 | 33,600 (train), 14,400 (test) |
| nba | Next-Best-Action dataset. | Next Best Action | - | 6 | 43,196 (train), 12,829 (test) |
| mta | DP6 dataset for marketing attribution models. | Multi-Touch Attribution | - | 4 | 500,000 |
| dermatology | Dermatology diseases dataset. | ||||
| Determine 6 types of Eryhemato-Squamous disease. | Multi-class classification, Clustering | class | 35 | 366 | |
| creditcard | Credit card fraud dataset. | ||||
| Predict anonymized transactions as fraudulent or genuine. | Binary classification (Fraud detection) | fraud | 29 | 199,364 (train), 85,443 (test) | |
| cluto | Cluto dataset for clustering. | Clustering | class | 3 | 10,000 |
| covtype | Forestcover type dataset. | ||||
| Classification of pixels into 7 forest cover types. | Multiclass classification | target | 55 | 406,708 (train), 174,304(test) | |
| 20newsgroups | 20 newsgroup documents dataset. | ||||
| This data set comes from data in 20 different newsgroups. | Multiclass classification | target | 301 | 11,314 (train), 7,532 (test) | |
| 4,871 (inbalanced train) | |||||
| cosmetics_store | Cosmetics shop e-commerce events history dataset. | RFM analysis, Clustering | - | 5 | 1,287,007 |