# ML Datasets This notebook generates sample ML datasets in the specified output database. ### Workflow Example Find a sample workflow [here in Treasure Boxes](https://github.com/treasure-data/treasure-boxes/blob/automl/machine-learning-box/automl/ml_datasets.dig). ```yaml +load_datasets: ipynb>: notebook: ml_datasets output_database: ml_datasets datasets: all ``` ### Parameters | Parameter name | Parameter on Console | Description | Default Value | | --- | --- | --- | --- | | docker.task_mem | Docker Task Mem | Task memory size. Available values are 64g, 128g (default), 256g, 384g, or 512g depending on your contracted tiers. | 128g | | datasets | Datasets | An "all" or comma separated string to specify datasets to set up. | all | | output_database | Output Database | Dataset name to set up. | ml_datasets | | replace_if_exists | Replace If Exists | Replace a table if it already exists. Set to false by default. | false | ### Dataset Description | Dataset | Description | Associated Tasks | Target Column | Number of Columns | Number of Rows | | --- | --- | --- | --- | --- | --- | | [gluon](https://auto.gluon.ai/stable/tutorials/tabular/tabular-indepth.md) | AutoGluon example dataset. | Binary / Multiclass classification | class (binary), occupation (multiclass) | 15 | 39,073 (train), 9,769 (test) | | [bank_marketing](https://www.kaggle.com/datasets/ruthgn/bank-marketing-data-set) | Bank marketing dataset. | | | | | | Predict if the client will subscribe a term deposit. | Binary classification | y | 21 | 28,831 (train), 12,357(test) | | | [vehicle_coupon](https://archive.ics.uci.edu/dataset/603/in+vehicle+coupon+recommendation) | Vehicle coupon recommendation dataset. | | | | | | Recommend a coupon to driver on different scenarios. | Multiclass classification | coupon | 26 | 8,878 (train), 3,806 (test) | | | [online_retail](https://archive.ics.uci.edu/ml/datasets/Online+Retail) | Online retail transactional dataset. | | | | | | Predict LTV score for each customer. | Regression (CLTV prediction), RFM | cltv | 11 | 2,230 (train), 956 (test) | | | [telco_churn](https://www.kaggle.com/blastchar/telco-customer-churn/data) | Telco churn event dataset. | Binary classification (Churn prediction) | churn | 21 | 4,930 (train), 2,113 (test) | | [california_house](https://scikit-learn.org/stable/datasets/real_world.md#california-housing-dataset) | House price dataset of California. | | | | | | Predict house prices. | Regression | median_house_value | 10 | 14,448 (train), 6,192 (test) | | | transition_matrix | Sample transition dataset of web access. | | | | | | Analyze web access transitions. | Network Analysis | - | 3 | 12 | | | [ts_airline](https://www.sktime.net/en/stable/api_reference/auto_generated/sktime.datasets.load_airline.md) | Time-series airline passenger dataset. | | | | | | Forecast the number of passengers. | Time-series Forecasting (Univariate) | number_of_airline_passengers | 2 | 100 (train), 44 (test) | | | [m4](https://www.kaggle.com/datasets/yogesh94/m4-forecasting-competition-dataset) | Quarterly time series of M4 dataset. | Time-series Forecasting (Multivariate) | v7 (or any v?) | 867 | 33,600 (train), 14,400 (test) | | nba | Next-Best-Action dataset. | Next Best Action | - | 6 | 43,196 (train), 12,829 (test) | | [mta](https://dp6.github.io/Marketing-Attribution-Models/) | DP6 dataset for marketing attribution models. | Multi-Touch Attribution | - | 4 | 500,000 | | [dermatology](https://archive.ics.uci.edu/ml/datasets/dermatology) | Dermatology diseases dataset. | | | | | | Determine 6 types of Eryhemato-Squamous disease. | Multi-class classification, Clustering | class | 35 | 366 | | | [creditcard](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) | Credit card fraud dataset. | | | | | | Predict anonymized transactions as fraudulent or genuine. | Binary classification (Fraud detection) | fraud | 29 | 199,364 (train), 85,443 (test) | | | [cluto](http://glaros.dtc.umn.edu/gkhome/views/cluto) | Cluto dataset for clustering. | Clustering | class | 3 | 10,000 | | [covtype](https://archive.ics.uci.edu/dataset/31/covertype) | Forestcover type dataset. | | | | | | Classification of pixels into 7 forest cover types. | Multiclass classification | target | 55 | 406,708 (train), 174,304(test) | | | [20newsgroups](http://qwone.com/~jason/20Newsgroups/) | 20 newsgroup documents dataset. | | | | | | This data set comes from data in 20 different newsgroups. | Multiclass classification | target | 301 | 11,314 (train), 7,532 (test) | | | 4,871 (inbalanced train) | | | | | | | [cosmetics_store](https://www.kaggle.com/datasets/mkechinov/ecommerce-events-history-in-cosmetics-shop) | Cosmetics shop e-commerce events history dataset. | RFM analysis, Clustering | - | 5 | 1,287,007 |