# ML Datasets

This notebook generates sample ML datasets in the specified output database.

### Workflow Example

Find a sample workflow [here in Treasure Boxes](https://github.com/treasure-data/treasure-boxes/blob/automl/machine-learning-box/automl/ml_datasets.dig).


```yaml
+load_datasets:
  ipynb>:
    notebook: ml_datasets
    output_database: ml_datasets
    datasets: all
```

### Parameters

| Parameter name | Parameter on Console | Description | Default Value |
|  --- | --- | --- | --- |
| docker.task_mem | Docker Task Mem | Task memory size. Available values are 64g, 128g (default), 256g, 384g, or 512g depending on your contracted tiers. | 128g |
| datasets | Datasets | An "all" or comma separated string to specify datasets to set up. | all |
| output_database | Output Database | Dataset name to set up. | ml_datasets |
| replace_if_exists | Replace If Exists | Replace a table if it already exists. Set to false by default. | false |


### Dataset Description

| Dataset | Description | Associated Tasks | Target Column | Number of Columns | Number of Rows |
|  --- | --- | --- | --- | --- | --- |
| [gluon](https://auto.gluon.ai/stable/tutorials/tabular/tabular-indepth.html) | AutoGluon example dataset. | Binary / Multiclass classification | class (binary), occupation (multiclass) | 15 | 39,073 (train), 9,769 (test) |
| [bank_marketing](https://www.kaggle.com/datasets/ruthgn/bank-marketing-data-set) | Bank marketing dataset. |  |  |  |  |
| Predict if the client will subscribe a term deposit. | Binary classification | y | 21 | 28,831 (train), 12,357(test) |  |
| [vehicle_coupon](https://archive.ics.uci.edu/dataset/603/in+vehicle+coupon+recommendation) | Vehicle coupon recommendation dataset. |  |  |  |  |
| Recommend a coupon to driver on different scenarios. | Multiclass classification | coupon | 26 | 8,878 (train), 3,806 (test) |  |
| [online_retail](https://archive.ics.uci.edu/ml/datasets/Online+Retail) | Online retail transactional dataset. |  |  |  |  |
| Predict LTV score for each customer. | Regression (CLTV prediction), RFM | cltv | 11 | 2,230 (train), 956 (test) |  |
| [telco_churn](https://www.kaggle.com/blastchar/telco-customer-churn/data) | Telco churn event dataset. | Binary classification (Churn prediction) | churn | 21 | 4,930 (train), 2,113 (test) |
| [california_house](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset) | House price dataset of California. |  |  |  |  |
| Predict house prices. | Regression | median_house_value | 10 | 14,448 (train), 6,192 (test) |  |
| transition_matrix | Sample transition dataset of web access. |  |  |  |  |
| Analyze web access transitions. | Network Analysis | - | 3 | 12 |  |
| [ts_airline](https://www.sktime.net/en/stable/api_reference/auto_generated/sktime.datasets.load_airline.html) | Time-series airline passenger dataset. |  |  |  |  |
| Forecast the number of passengers. | Time-series Forecasting (Univariate) | number_of_airline_passengers | 2 | 100 (train), 44 (test) |  |
| [m4](https://www.kaggle.com/datasets/yogesh94/m4-forecasting-competition-dataset) | Quarterly time series of M4 dataset. | Time-series Forecasting (Multivariate) | v7 (or any v?) | 867 | 33,600 (train), 14,400 (test) |
| nba | Next-Best-Action dataset. | Next Best Action | - | 6 | 43,196 (train), 12,829 (test) |
| [mta](https://dp6.github.io/Marketing-Attribution-Models/) | DP6 dataset for marketing attribution models. | Multi-Touch Attribution | - | 4 | 500,000 |
| [dermatology](https://archive.ics.uci.edu/ml/datasets/dermatology) | Dermatology diseases dataset. |  |  |  |  |
| Determine 6 types of Eryhemato-Squamous disease. | Multi-class classification, Clustering | class | 35 | 366 |  |
| [creditcard](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) | Credit card fraud dataset. |  |  |  |  |
| Predict anonymized transactions as fraudulent or genuine. | Binary classification (Fraud detection) | fraud | 29 | 199,364 (train), 85,443 (test) |  |
| [cluto](http://glaros.dtc.umn.edu/gkhome/views/cluto) | Cluto dataset for clustering. | Clustering | class | 3 | 10,000 |
| [covtype](https://archive.ics.uci.edu/dataset/31/covertype) | Forestcover type dataset. |  |  |  |  |
| Classification of pixels into 7 forest cover types. | Multiclass classification | target | 55 | 406,708 (train), 174,304(test) |  |
| [20newsgroups](http://qwone.com/~jason/20Newsgroups/) | 20 newsgroup documents dataset. |  |  |  |  |
| This data set comes from data in 20 different newsgroups. | Multiclass classification | target | 301 | 11,314 (train), 7,532 (test) |  |
| 4,871 (inbalanced train) |  |  |  |  |  |
| [cosmetics_store](https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-cosmetics-shop) | Cosmetics shop e-commerce events history dataset. | RFM analysis, Clustering | - | 5 | 1,287,007 |