In order to pursue energy efficiency, most deep learning accelerators use 8-bit or even lower bit-width computing units, especially on mobile platforms. Such low-bit-width accelerators can meet the accuracy requirements of inference tasks with special technical means, but they cannot be used during training, because the numerical sensitivity of the training process is much higher. How to extend the architecture to enable efficient mobile training?

In response to this problem, we proposed Cambricon-Q.

Cambricon-Q has introduced three new modules:

  • SQU supports on-the-fly statistics and quantization;
  • QBC manages the mixed precision and data format for the on-chip buffers;
  • NDPO performs the weight update process at the near memory end.

The proposed architecture can support a variety of quantization-aware training algorithms. Experiments show that Cambricon-Q achieves efficient training with negligible accuracy loss.

Published on ISCA 2021. [DOI]