The memory usage of a Gluon Train notebook is primarily influenced by the dataset's size in terms of both records (rows) and columns.
When choosing an instance for training, a user should consider these dimensions and then allocate an appropriate memory size.
To help users in figuring out memory size requirements for different datasets, this page provides some sample memory usage corresponding to various numbers of records and columns within a specific dataset for your reference.
- Input Dataset: Criteo 1TB Dataset (https://ailab.criteo.com/download-criteo-1tb-click-logs-dataset/)
- Problem Type: Classification
- Time Limit: 6 * 60 * 60 seconds ( = 6 hours)
Specifying a large table ( such as one containing 100 million records ) as input may lead to an out-of-memory error. In such instances, it is recommended to configure the sampling_threshold parameter within the Gluon Train notebook.
Treasure Data AutoML uses 10M for s ampling_threshold by default and usually that is enough. You may not need to use the whole dataset for training.
| number of records (Million) | number of columns | memory usage (GiB) |
|---|---|---|
| 1 | 39 | 27.7 |
| 10 | 39 | 122.3 |
| 20 | 39 | 135.3 |
| 30 | 39 | 199.8 |
| 40 | 39 | 266.5 |
| 50 | 39 | 338.3 |
| number of records (Million) | number of columns | memory usage (GiB) |
|---|---|---|
| 50 | 20 | 206 |
| 50 | 39 | 338.3 |