Training on the edge | Argonne National Laboratory

August 12, 2019

Living on the edge – to some, the words may bring to mind living life to the fullest, to others, taking unnecessary risks. Computing on the edge similarly implies benefits and challenges.

With edge computing, the computing capability is distributed across a large number of small devices instead of a centralized point, bringing the action – data, services, computation – closer to the user. Inference on the edge has thus become common.

But training on the edge has not.

The reasons are numerous. If the information to be learned is relevant to the other edge nodes, the updated model must be transferred between the different nodes, potentially introducing excessive communication and increased bandwidth and latency. And even if the new information is relevant only to training the current node, the limited capacity of the edge node may not be able to handle the memory requirements.

To address these issues, a team of researchers from Imperial College London, Inria Bordeaux, and Argonne National Laboratory examined scenarios involving training on the edge and the kinds of strategy that can make such training worthwhile.

The researchers first looked at the so-called viewpoint problem.

“When the images collected are from a set angle, say at eye level with the subject facing the camera, the ‘teacher’ model will be trained to recognize images taken only at similar angles,” said Nicola Ferrier, a senior scientist in the Mathematics and Computer Science (MCS) division at Argonne National Laboratory. If the subject appears in other parts of the same frame, however, an object-tracking model can set aside these images and used to train a new “student” model. The model running in each node can even be customized to have its own viewpoint, and it has the advantage that no additional data needs to be transferred to the node beyond the original teacher model.

But this student-teacher model still leaves computational issues. For example, with larger batches, keeping the model in memory may be impossible, even with the standard resolution. And with higher resolution, the memory problem is worse even for smaller batch sizes.

Another possible approach is checkpointing, which is used by many neural networks today. A checkpoint contains information, such as the trained model state and parameters, that can be saved and used for resuming training from that point. While keeping the number of checkpoints low can reduce the peak memory footprint, however, important states may be lost. Moreover, some programs have a lower bound below which memory cannot be reduced. For edge devices, this bound can be crucial if large models are used.

The researchers then explored how edge computing can benefit from binomial checkpointing.

“A key consideration is the recompute factor– the ratio between the extended time to solution due to recomputations induced by the memory-saving checkpointing,” said Paul Hovland, deputy director of the MCS division.

The researchers showed that for small recompute factors, the memory requirement is often prohibitively high, especially for an edge node. On the other hand, for a larger recompute factor, the lower memory consumption allows larger batch sizes, without prohibitively increasing the processing time.

For the full paper describing the research, see

N. Kukreja, Al Shilova, O. Beaumont, J. Hückelheim, N. Ferrier, P. Hovland, and G. Gorman, “Training on the Edge: The why and the how,” https://arxiv.org/pdf/1903.03051.pdf