Zhao, Yongwei

Cambricon-Q: A Hybrid Architecture for Efficient Training

2021-06-24 talk paper conference isca

In order to pursue energy efficiency, most deep learning accelerators use 8-bit or even lower bit-width computing units, especially on mobile platforms. Such low-bit-width accelerators can meet the accuracy requirements of inference tasks with special technical means, but they cannot be used during training, because the numerical sensitivity of the training process is much higher. How to extend the architecture to enable efficient mobile training?

In response to this problem, we proposed Cambricon-Q.

Cambricon-Q has introduced three new modules:

SQU supports on-the-fly statistics and quantization;
QBC manages the mixed precision and data format for the on-chip buffers;
NDPO performs the weight update process at the near memory end.

The proposed architecture can support a variety of quantization-aware training algorithms. Experiments show that Cambricon-Q achieves efficient training with negligible accuracy loss.

Published on ISCA 2021. [DOI]

Cambricon-FR：Fractal Reconfigurable ISA Machines (Universal Fractal Machines)

2020-07-01 paper journal tc

This work follows Cambricon-F: Machine Learning Computers with Fractal von Neumann Architecture.

Cambricon-F obtains the programming scale-invariant property via fractal execution, alleviating the programming productivity issue of machine learning computers. However, the fractal execution on this computer is by the hardware controller and only supports a few common basic operators (convolution, pooling, etc.). Other functions need to be built on the sequence of these operators. We have found that when a limited and fixed instruction set is used to support complex and variable application payloads, inefficiency will occur.

When supporting regular algorithms such as conventional CNNs, the machine can achieve optimal efficiency. However, in complex and variable application scenarios, even if the application itself conforms to the definition of fractal operation, it will cause inefficiency phenomenon. The inefficiency phenomenon is defined as a suboptimal computational or communication complexity when certain applications are executed on a fractal computer. This paper uses TopK and 3DConv to illustrate the inefficiency phenomenon.

An intuitive example: The user wants to execute the application Bayesian Network, which conforms to the definition of fractal operation and can be executed efficiently in a fractal manner; But because there is no such “Bayesian” instruction in Cambricon-F, the application can only be decomposed into a series of basic operations and then executed serially. If the instruction set can be expanded, and a BAYES fractal instruction is added, the fractal execution can be maintained until the leaf node is reached, which significantly improves the computational efficiency.

Based on this, we improved the architecture of Cambricon-F and proposed Cambricon-FR with a fractal reconfigurable instruction set structure. Analytically, Cambricon-F is a Fractal Machine, while Cambricon-FR can be seen as a Universal Fractal Machine; Cambricon-F can achieve optimal efficiency on a specific application payload, while Cambricon-FR can achieve optimal efficiency on complex and variable application payloads.

Published in “IEEE Transactions on Computers”. [DOI]

Cambricon-F: Machine Learning Computers with Fractal von Neumann Architecture

2019-06-26 paper conference isca

During the work as a software architect in the Cambricon Tech, I deeply realized the pain points of software engineering. When I first took over in 2016, the core software was developed by me and WANG Yuqing, with 15,000 lines of code; when I left in 2018, the development team increased to more than 60 people, with 720,000 lines of code. From the perspective of lines, the complexity of software doubles every 5 months. No matter how much manpower is added, the team is still under tremendous development pressure: customer needs are urgent and need to be dealt with immediately; New features need to be developed, the accumulated old code needs to be refactored; the documentation has not yet been established; the tests have not yet been established…

I may not be a professional software architect, but who can guarantee that the future changes are foreseen from the very beginning? Just imagine: the underlying hardware was single-core; it became multi-core a year later; then it became NUMA another year later. With such a rapid evolution, how can the same software be able to keep up without undergoing thorough refactoring? The key to the problem is that, the scale of the hardware has increased, so the level of abstraction that needs to be programmed and controlled is also increasing, making programming more complicated. We define the problem as the programming scale-variance.

In order to solve this problem from engineering practices, we started the research, namely Cambricon-F.

Addressing the scale-variance of programming, it is necessary to introduce some kinds of scale invariants. The invariant we found is fractal: the geometric fractals are self-similar on different scales. We define the workload in a fractal manner, so does the hardware architecture. Both scale invariants can be zoomed freely until a scale that is compatible with each other is found.

Cambricon-F first proposed the Fractal von Neumann Architecture. The key features of this architecture are:

Sequential code, parallel execution adapted to the hardware scale automatically;
Programming Scale-invariance: hardware scale is not coded, therefore code transfers freely between different Cambricon-F instances;
High efficiency retained by fractal pipelining.

Published on ISCA 2019. [DOI]