Paper Notes: OPU - FPGA-Based Overlay Processor for CNNs
Last updated on August 7, 2023 pm
Paper Notes: OPU - FPGA-Based Overlay Processor for CNNs
Info
OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks
Authors: Yunxuan Yu, Chen Wu, Tiandong Zhao, Kun Wang, Lei He
Published in: TVLSI 2020 Jan.
DOI: 10.1109/TVLSI.2019.2939726
Keywords: CNN overlay processor, FPGA acceleration, hardware-software codesign
Background & Problems
FPGA acceleration for CNNs -> automatic compilers for FPGA CNN accelerators
- Can not achieve the best performance
- Impossible for edge computing
Key Ideas
RTL-based hand-coded FPGA overlay domain-specific processor unit (OPU) with software-programmability and fast compilation time, targeting at general CNN accelerations.
Features:
- User friendliness like CPU/GPU
- Domain-specific ISA with optimized granularity: flexibility, efficiency and lower complexity
- FPGA-based high-performance microarchitecture: computation, data communication and reorganization
- Compiler with comprehensive optimization
Implementation
ISA
Conditional instructions (C-type)
Unconditional instructions (U-type)
flowchart LR
A["1 C-type"]
B["1 to n U-type"]
PE["1 Processing Element (PE)<br>module"]
subgraph IB [Instruction Block]
subgraph IU [Instruction Unit]
A
B
end
subgraph IU1 [many instruction units]
C[Instructions]
end
end
IB --- PE
One instruction block is fetched together and distributed to one processing element module.
C-type
Specify target operations and set operation trigger conditions
- Operation code (OP code): target operation
- trigger condition: when operation is ready to execute
6 types in total:
- Memory Read
- Memory Write
- Data Fetch
- Compute
- Post Process: combination of pooling, activation, data quantization, intermediate result addition and residual operations
- Instruction Read
Each one corresponds to a dedicated operation module in the PE.
U-type
Deliver corresponding operation parameters for its paired C-type
Microarchitecture
Overlay: reconfigurable architectures implemented on top of FPGAs. They are regular designs described using structural HDL, but have reconfigurable capabilities. They may be considered as “softcore FPGA IPs (Semiconductor intellectual property core)”.
Compiler
Do the following on input CNN configuration:
- Operation fusion
- Network slicing
- Throughput optimization
Divided into two major stages: Translation and Optimization.
Extract necessary information from model definition files and reorganize them into a unified intermediate representation (IR)
- perform operation fusion to combine closely related operations.
- perform data quantization (generate dynamic fixed-point representations)
- perform network slicing
- perform optimization
- rearrange processed weights
Translation: Operation Fusion
merge or concatenate related layer operations
- p-fusion: only contributes to off-chip memory access reduction
- r-fusion:
- avoids communication latency
- reduces the total number of operations and inference time
Major layers: Convolution and Fully Connected (FC) layers
Affiliated layers: Pooling, Padding, Activation, Residual and Output Concatenation layers
r-fusion-I: batch normalization elimination, avoiding separate computation of batch normalization
r-fusion-II: input sharing, identifying input sharing layers and reassembling them
Data Quantization
Use dynamic quantization scheme to get 8-bit fixed-point values.
Intermediate Representation
Experiment
MAC: multiply and accumulate Multiply–accumulate operation - Wikipedia
Limitation
Y. Yu, C. Wu, T. Zhao, K. Wang and L. He, “OPU: An FPGA-Based Overlay Processor for Convolutional Neural Networks,” in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 1, pp. 35-47, Jan. 2020, doi: 10.1109/TVLSI.2019.2939726.
Unfinished