# Time Series Forecasting

Time series forecasting is the task of fitting a model to historical, time-stamped data to predict future values. This notebook trains time-series models and forecasts future values using [FLAML](https://microsoft.github.io/FLAML/). The supported models are as follows:

* [Random Forest](https://en.wikipedia.org/wiki/Random_forest)
* [Extra Trees](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html)
* [LightGBM](https://lightgbm.readthedocs.io/)
* [XGBoost](https://en.wikipedia.org/wiki/XGBoost)
* [Prophet](https://facebook.github.io/prophet)
* [ARIMA](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average)
* [SARIMAX](https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html)


This notebook also runs additional EDA steps and hold-out tests.

### Assumed Input Table

This notebook assumes the following table format as the input of training.

| **tstamp** | **..** | **value** |
|  --- | --- | --- |
| 2022/04/21 10:00 | .. | 50 |
| 2022/04/21 10:00 | .. | 30 |
| 2022/04/21 11:00 | .. | 70 |
| 2022/04/21 11:00 | .. | 30 |
| 2022/04/21 12:00 | .. | 100 |
| 2022/04/21 12:00 | .. | 30 |


By default, we assume tstamp_column="tstamp" and target_column="value" but you can specify any column names for them.

Optionally, you can provide [exogenous variables](https://timeseriesreasoning.com/contents/exogenous-and-endogenous-variables/). When forecasting [daily store sales of a drug store chain](https://www.kaggle.com/competitions/rossmann-store-sales/) for instance, you can specify exogenous_columns: weather, promotions, store_type as auxiliary features explaining daily sales.

| **tstamp** | **weather** | **promotions** | **store_type** | **sales** |
|  --- | --- | --- | --- | --- |
| 1960-12-01 | cloudy | 2 | city_large | 459 |
| 1961-01-01 | sunny | 1 | contry_small | 935 |
| ... | ... |  |  |  |
|  |  |  |  |  |
| ... |  |  |  |  |
| 1965-12-01 | rainy | 0 | city_small | 886 |


### Sample Output

If forecast_length=30 is specified, +30 further records to training data are forecasted. On the other hand, test_table is provided, forecast for the test data. The test_table must at least have tstamp_column, "tstamp" by default setting. A target_column, "value" by the default, is attached to the output_table.

Note pesudo_tstamp is used and resulted in addition to them if tstamp_column does not have valid datetime values.

| **tstamp** | **value** |
|  --- | --- |
| 1960-12-01 | 0.29304519295692444 |
| 1961-01-01 | 0.00487339636310935 |
| ... | ... |
| 1965-12-01 | 0.5266873240470886 |


The visualization of show forecasted results is as follows:
![](/assets/72061491.f86a6dc8602721837e43b7c44d7179d7564b702ff905e73ef0ada9babe5af4c1.3cb60505.png)

![](/assets/72061490.61770e0241024ecabb344fea215601a93a19e55b47898848f5c0eda98eaaec88.3cb60505.png)

Workflow Example

Find a sample workflow [here in Treasure Boxes](https://github.com/treasure-data/treasure-boxes/blob/automl/machine-learning-box/automl/ts_forecast.dig).

+run_ts_forecast:
  ipynb>:
    notebook: ts_forecast
    train_table: ml_datasets.ts_airline
    tstamp_column: period
    forecast_length: 30
    output_table: ml_test.ts_airline_predicted

### Parameters

| Parameter name | Parameter on Console | Description | Default Value |
|  --- | --- | --- | --- |
| docker.task_mem | Docker Task Mem | Task memory size. Available values are 64g, 128g (default), 256g, 384g, or 512g depending on your contracted tiers | 128g |
| train_table | Train Table | specify a TD table used for training as dbname.table_name | - |
| forecast_length | Forecast Length | length of forecasting output, either test_table or forecast_length is required | - |
| forecast_freq | Forecast Freq | Explicit frequency for forecasting. Accepted values: D - daily, W - weekly, M - monthly, Q - quarterly, Y - yearly. If not specified, the value is inferred from the data. | - |
| test_table | Test Table | TD table name used for prediction. Either test_table or forecast_length is required | - |
| tstamp_column | Tstamp Column | A timestamp column to sort time series data | tstamp |
| target_column | Target Column | column name used for the label | value |
| output_table | Output Table | TD table name to export the prediction result | - |
| output_mode | Output Mode | Output mode for exporting output_table: overwrite/replace or append. Usually no need to specify and "append" for semi-realtime prediction. | overwrite |
| exogenous_columns | Exogenous Columns | columns that can be used as prediction input. Can use "*" to select all columns in the train_table | - |
| ignore_columns | Ignore Columns | columns to ignore as exogenous variables | time |
| estimators | Estimators | Estimators used for timeseries forecasting. Supported estimators: prophet,arima,lgbm | prophet,arima,lgbm,xgboost,xgb_limitdepth |
| time_limit | Time Limit | soft limit for training time budget in seconds | 60 * 60 |
| sampling_threshold | Sampling Threshold | threshold used for sampling training data | 10_000_000 |
| hide_table_contents | Hide Table Contents | suppress showing table contents | false |
| calibration | Calibration | If true, the output value will be calibrated. | false |