Residue Number Systems Quantization for Deep Learning Inference
SERGEY SIVKOV
Electrical Engineering Faculty
Perm National Research Polytechnic University
614013, Perm, 7 Professora Pozdeeva Street, Office 225
RUSSIA
Abstract: Quantization of learned CNN weights to Residue Number System can improve inference latency by
taking advantage of fast and precise low bit integer arithmetic. In this paper we review the mathematical aspects
of RNS operations for signed integer values and evaluate implementation choices for conversion of conventional
float-point PyTorch weights of CNN models to RNS representation. We also present a workflow to convert
weights of PyTorch neural network layers specific for computer vision domain to 4-bit RNS moduli-sets able to
maintain classification accuracy within 5% of 8-bit quantization baseline.
Key-Words: FPGA, inference, RNS, Residue Number System, Verilog, PyTorch, Quantization
1 Introduction
At the present time, most neural networks are trained
using weights represented by 32-bit single-precision
floating-point numbers format. Pretrained weights
can be converted (quantized)[1] to other floating-
point number formats, such as 16-bit floating-point
(IEEE fp16) without noticeable loss of object classifi-
cation precision. It is also possible to convert weights
to integer formats, with a greater loss of accuracy dur-
ing inference.
The main advantage of quantization into integer
formats is increased performance: for 8-bit integers
this is, theoretically, a 16x acceleration in computa-
tion throughput and a 4x reduction of the transmit-
ted data bandwidth [2]. Also, inference engine imple-
mented on low-bit integers can be used on low-power
or wearable devices that do not have a hardware based
floating-point arithmetic.
The main contribution to the inference latency of
neural network calculations is made by the process of
matrix calculations, which comes down to a sequence
of multiplication and addition operations. If we look
at the implementation of an ripple carry adder and at
the implementation of a multiplier in the form of an
array of adders, it is obvious that the execution time
of the addition operation linearly depends on the bit
size of the operands O(N), and the execution time of
the multiplication operation quadratically depends on
the bitness of the operands O(N2). Also, reducing
the bit content of the operands allows to implement
a computer circuit with a shorter critical path, so we
may to increase its clock frequency.
In addition to positional number systems, non-
positional number systems are also known, for exam-
ple, Residue Number Systems (RNS) [3], where the
use of independent and parallel operations on low-
bit coprime bases makes it possible to perform the
same basic operations in time O(N)for addition
and O(N)for multiplication. We propose to use RNS
for the FPGA based implementation of base opera-
tions required to neural network accelerator.
2 Quantization Fundamentals
Increasing the speed of inference of neural networks
is an urgent task that can be solved by quantizing
weights into integers. Quantization is based on the
affine transformation f(x) = s·x+z, in which we
transform the input value x[β,α] into the range
[−2b1,2b1 1].
Due to the peculiarities of the formation of the
signed representation of numbers in RNS, we will
consider only quantization into a symmetric interval,
with a coefficient z= 0 (only scaling the interval,
without moving the 0th point of the representation).
This transformation is described for the interval x
[-α,α], and the coefficient sis found as
s=2b1
α(1)
This scheme is implemented, for example, in the
PyTorch torch.ao.quantization module, for quantiza-
tion to 8-bit integers. In case the source code of the
model is available, this approach allows to carry out
layer-by-layer quantization of the model. Unfortu-
nately, the module does not yet support quantization
to smaller data types, for example, to 4-bit integers
or quantization with the single coefficient sfor the
entire model. Also, this approach assumes the use
Received: July 28, 2023. Revised: October 23, 2023. Accepted: November 25, 2023. Published: December 31, 2023.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2023.22.33
Sergey Sivkov
E-ISSN: 2224-2872
296
Volume 22, 2023
of either integers of higher bit capacity or float-point
arithmetic to renorm weights after forward pass layer
calculations, which will introduce additional delays
into the inference latency for a low-bit CPU. The lim-
itations inherent in calculations on numbers repre-
sented in the positional number system can be over-
come by using calculations on numbers represented
in the RNS.
2.1 Unsigned RNS Fundamentals
The limitations inherent in calculations on numbers
represented in the positional number system can be
overcome by using calculations on numbers repre-
sented in the RNS. Let us show the basic operations in
RNS based arithmetic on a simple example with two
coprime moduli-set:
p1 = 3 and p2 = 7
with such a choice of the moduli-set the range of rep-
resentable numbers will be [0..3·71).
Let’s take the number 17 and convert it to RNS
(hereinafter, % is the operation of taking the remain-
der modulo):
17 = (17%3; 17%7) = (2; 3)
This method requires repeated use of the opera-
tion of obtaining the remainder of an integer division.
There is another way to quickly convert from a po-
sitional number system, such as decimal, hexadeci-
mal or binary, which is based on the basic property of
RNS: performing operations on the bases in parallel
and independently.
For example, consider the conversion from the
decimal system to RNS:
17 = 1 101+ 7 100
Knowing the representation of all powers of 10,
in the range allowed for RNS:
100= (1; 1), because 1%3 = 1 and 1%7 = 1
101= (1; 3), because 10%3 = 1 and 10%7 = 3
Also, knowing the representation of each digit of
the number in RNS:
1 = (1%3; 1%7) = (1; 1)
7 = (7%3; 7%7) = (1; 0)
We get the following representation of the transla-
tion operation:
17 = 1101+7100= (1; 1)(1; 3)+(1; 0)(1; 1) =
(1; 3) + (1; 0) = (2; 3)
2.2 Signed RNS Fundamentals
In the process of calculating the layers of a neural net-
work, we need to be able to represent both positive
and negative weights in the RNS. There are several
approaches to solving this problem, let’s look at rep-
resenting numbers with a range shift (an socalled arti-
ficial number form), where both negative and positive
numbers are mapped to positive numbers.
As an example, we can extend our example with
the radix p3=2, and use a range-shifted represen-
tation of numbers, in which numbers in the range
[0..p1·p2) represent negative numbers, and numbers
in the range [p1p2..p1p2p3) are positive.
Then the form of representation of the number N
with offset: N0=p1p2 + N
N0=p1p2N
The operations of addition and subtraction in this
case can be derived as:
let N10=p1p2 + N1and N20=p1p2 + N2
then N10+N20=p1p2 + N1 + p1p2 + N2
considering that (N1 + N2)0=N1 + N2 + p1p2
we get (N1 + N2)0=N10+N20p1p2
Taking into account that p1p2is (0; 0; 1) in RNS
with p3 = 2 added to moduli-sets,
we get (N1 + N2)0=N10+N20p1p2 =
(N1 + N2)0=N10+N20+p1p2
You can also notice that getting from the number
N
numbers
N
perhaps by subtracting it from the
number (p1; p2; 1).
Consideration of the multiplication operation
requires similar reasoning:
N10N20=N1N2 + p1p2N1 + p1p2
N2 + p1p2p1p2
those
(N1N2)0=N10N20+p1p2p1p2(N1 +
N2 + p1p2)
considering that p1p2is (0; 0; 1) in the RNS with
p3=2added, and that during the operation the
numbers will be given in prime form, we get
(N1N2)0=N10N20+p1p2(1 + p1p2 +
N10+N20)
since p1p2is an odd number, we get
(N1N2)0=N10N20+p1p2(N10+N20)
Note that if N10and N20have the same parity, then
p1p2(N10+N20)is (0; 0; 0)
And if their parity is different, then
p1p2(N10+N20)is (0; 0; 1), i.e. p1p2.
Also, for RNS it is quite simple, using pre-
calculated auxiliary bit vectors, to implement the op-
erations of checking the sign of a number and dividing
by the bases of the system, which is sufficient both to
implement the ReLU operation and to implement ap-
proximate dequantization after matrix operations.
More general comparison or division operations in
RNS are also feasible, but require storage or calcu-
lation of additional parameters required for such op-
erations. Because These operations are not used to
implement the basic functions of the RNS inference
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2023.22.33
Sergey Sivkov
E-ISSN: 2224-2872
297
Volume 22, 2023
accelerator of the NN, then difficulties with the imple-
mentation of these operations do not affect our work.
2.3 Suggested workflow formulation
The objectives of this work are:
selection of a family of NN models and their
characteristic layers for implementation in the
form of Verilog code of the RNS computer
transfer of model weights to a quantized repre-
sentation in RNS
construction of a reference model for the possi-
bility of checking the results of verification of the
Verilog code of the calculator RNS
implementation of basic blocks of RNS for
FPGA
comparison of the obtained result with the in-
ference accuracy of float32 weights and int8
weights of the original model.
3 Problem Solution
As a target family of neural network models, we con-
sider neural network architectures used in computer
vision, due to the huge number of devices such as
smart video cameras or wearable electronics that use
them. Also, we consider only those types of layers
that can be used in single-stage neural networks for
classification, localization or detection tasks. This
family includes convolutional neural networks with
the following layers (in PyTorch terms):
Conv2d layer providing convolution operation
ReLU layer ensuring the introduction of nonlin-
earity into the neural network
MaxPool2d layer providing selection of the max-
imum value in a given window of the input tensor
AvgPool2d layer providing calculation of the av-
erage value in a given window of the input tensor
Linear fully connected layer (MLP, perceptron).
Neural networks (f.ex. LeNet, AlexNet, VGG)
built only using the listed layers have repeatedly
achieved SOTA results on standard datasets (MNIST,
CIFAR, ImageNet) for computer vision tasks.
We chose MNIST as a dataset to evaluate the re-
sults obtained.
An example code of a PyTorch NN model that
achieves an accuracy of 92.67% on float32 weights
on the MNIST dataset presented at Fig.1.
The estimate of the number of model param-
eters and its computational complexity for the
# p a r t o f c l a s s c o n s t r u c t o r
# r e s p o s i b l e f o r n e u r o n e t s t r u c t u r e :
s e l f . conv1 = nn . Conv2d ( 1 , 4 , 3 , 1 , 1 )
s e l f . r e l u 1 = nn . ReLU ( )
s e l f . mp1 = nn . MaxPool2d ( 2 )
s e l f . conv2 = nn . Conv2d ( 4 , 8 , 3 , 1 , 1 )
s e l f . r e l u 2 = nn . ReLU ( )
s e l f . mp2 = nn . MaxPool2d ( 2 )
s e l f . conv3 = nn . Conv2d ( 8 , 1 6 , 3 , 1 , 1 )
s e l f . r e l u 3 = nn . ReLU ( )
s e l f . ap3 = nn . AvgPool2d ( 7 )
s e l f . o u t = nn . L i n e a r ( 1 6 , 1 0)
# method r e s p o s i b l e f o r model
#inference :
d e f f o r w a r d ( s e l f , x ) :
x = s e l f . con v1 ( x )
x = s e l f . r e l u 1 ( x )
x = s e l f . mp1 ( x )
x = s e l f . con v2 ( x )
x = s e l f . r e l u 2 ( x )
x = s e l f . mp2 ( x )
x = s e l f . con v3 ( x )
x = s e l f . r e l u 3 ( x )
x = s e l f . ap3 ( x )
x = x . view ( x . s i z e ( 0 ) , 1)
o u t p u t = s e l f . o u t ( x )
r e t u r n o u t p u t
Fig.1: CNN Model structure and forward pass method
|module |#params or shape| #flops |
|:--------- |:----------- |:---- |
|model | 1.674K | 0.141M |
| conv1 | 40 | 28.224K|
| conv1.weight| (4, 1, 3, 3) | |
| conv1.bias | (4,) | |
| conv2 | 0.296K | 56.448K|
| conv2.weight| (8, 4, 3, 3) | |
| conv2.bias | (8,) | |
| conv3 | 1.168K | 56.448K|
| conv3.weight| (16, 8, 3, 3)| |
| conv3.bias | (16,) | |
| out | 0.17K | 0.16K |
| out.weight | (10, 16) | |
| out.bias | (10,) | |
Fig.2: CNN Model parameters and FLOPS
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2023.22.33
Sergey Sivkov
E-ISSN: 2224-2872
298
Volume 22, 2023
above model by FaceBook’s PyTorch package
fvcore.nn.FlopCountAnalysis presented at
Fig.2.
The reference model was implemented in C++ in
5 versions:
1. with weights represented by real 32-bit numbers
2. with weights represented by 16-bit integers
3. with weights represented by 8-bit integers
4. with weights represented by 4-bit integers
5. with weights represented by class objects that
implement RNS operations, each base of which
does not go beyond a 4-bit integer.
The real weights for the reference model were con-
verted directly from the weights of the corresponding
PyTorch tensors of the NN model.
The values of the output vectors shown by the ref-
erence model with real weights coincided with the
weights shown by the PyTorch NN model.
The Verilog implementation of the RNS calculator
is based on a code generator for the given RNS bases
and the required NS functionality.
The capabilities of our own code generator are
clear from its prompt presented at Fig.3. An exam-
ple of generated code for the multiplication operation
in RNS based on P= 3 presented at Fig.4. Exam-
ple of generated code for the multiplication operation
for RNS with radix (P1 = 2; P2 = 3; P3 = 7) pre-
sented at Fig.5.
4 Conclusion
The PyTorch model of the convolutional neural net-
work presented at Fig.1 was trained, and 92.67% ac-
curacy was achieved after 10 epochs of training on the
MNIST dataset with batch 50 as showed at Fig.6.
The reference model implemented in C++ showed
the same accuracy and the same output vectors.
Running the reference model with a single scaling
factor on int16 scales shows 80.12% accuracy.
Running the reference model with a single scaling
factor on int8 scales shows 74.30% accuracy.
Running the reference model with a single scaling
factor on int4 scales shows 14.58% accuracy.
Launch of a reference model with a single scaling
coefficient on RNS scales with bases (P1 = 2, P 2 =
3, P 3=5, P 4 = 7), none of which go beyond int4
shows 71.13% accuracy.
Below is a Table 1 comparing the number of LEs
involved when implementing the multiplication oper-
ation both in the positional number system and in the
RNS: We used RNS with about the same bit arity as
their 2n*2nscheme as shown at Table 2.
gen.py --help
usage: g.py [-h] [--signed SIGNED]
[--op OP]
[--base BASE] [--rnsop RNSOP]
[--vecop VECOP]
[--bases BASES [BASES ...] ]
[--tests TESTS]
optional arguments:
-h, --help show this help message
and exit
--signed SIGNED Use artificial number
representation
--op OP Type of operation to generate:
mul, add, div2, neg (by default all
modules)
--base BASE Value of base to generate
--rnsop RNSOP Type of RNS operation to
generate: mul, add, div2, neg, lte
(by default all appropriate modules)
--vecop VECOP Type of operation to
generate: mul, add, div2, neg, lte,
k3x3, max_pool, gac_pool, relu
(by default all appropriate modules)
--bases BASES [BASES ...]
List of bases to use for vector
operations
--tests TESTS Generate coverage and
randomized tests for required
operations
(may be valued as [1..100])
Fig.3: Help screen of Verilog code generator
module mulP3
(input logic[1:0] a, b,
output logic[1:0] y);
logic[1:0] m;
always_comb
case ({a,b})
'b01_01: m <= 1;
'b01_10: m <= 2;
'b10_01: m <= 2;
'b10_10: m <= 1;
default: m <= 0;
endcase
assign y = m;
endmodule
Fig.4: Generated Verilog code for one RNS moduli
base operation
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2023.22.33
Sergey Sivkov
E-ISSN: 2224-2872
299
Volume 22, 2023
module mul2P2P3P7
(input logic[2:0] a1, b1,
input logic[2:0] a2, b2,
input logic[3:0] a3, b3
output logic[2:0] y1,
output logic[2:0] y2,
output logic[3:0] y3);
mulP2(a1, b1, y1);
mulP3(a2, b2, y2);
mulP7(a3, b3, y3);
endmodule
Fig.5: Generated Verilog code for RNS moduli-set
operation
Fig.6: Traing/test accuracy
Bit-arity multiplier RNS moduli-set
LEs delay LEs delay
8* 8 46 17ns 20 11ns
16*16 183 26ns 245 15ns
32*32 625 32ns 3577 23ns
64*64 2273 47ns 13009 31ns
Table 1: Circuit sizes and latency
Bit-arity RNS moduli-set bases
282; 3; 5; 7
216 11; 17; 19; 23
232 11; 17; 19; 23; 29; 31; 61
264 11; 13; 17; 19; 23; 29; 31;
37; 41; 43; 47; 53; 59
Table 2: List of usedRNSmoduli-setbases
An we can see RNS scheme with 4 bases, two of
which are 2-bit and another two are 4-bit, shows an
accuracy that differs by 4.46% from the accuracy of a
scheme on 8-bit integers and exceeds the accuracy of
a scheme on 4-bit integers by 387.86%.
At the same time, the speed of the multiplication
operation in the RNS, which is basic for matrix cal-
culations, is 64.7% faster than the 8-bit multiplier
and the required number of LEs for implementation
is 43.47% less.
The paper [4] showed a nice example of specific
moduli-set (P1 = 2n1, P 2 = 2n, P 3 = 2n+
1). Such moduli-set allows very concise, in terms of
LEs, design. Also, the paper shows an important idea
to measure power consumption of our scheme. This
question will be the subject of our future research.
From a practical point of view, it is important to
be able to accept as input not quantized data con-
verted from the positional number system to RNS, but
the original stream from the CMOS sensor, for exam-
ple, in the R5G6B5format, to reduce the load on the
CPU and/or directly connect the CMOS sensor to the
FPGA and quantize them In parallel with the calcula-
tion of the convolutions of the first layer, ideas for de-
riving a recurrence formula for each base are outlined
in Omondi’s monograph [3]. Authors of [5] paper use
another method to convert numbers to RNS but also
show a good performance boost for their design.
We plan to implement a more compact version of
this accelerator design based on the following idea:
because we know all weights for all neural network
layers for inference tasks, no one multiplicator or
summator needs in two argument realization. I.e. one
argument already known, so the Verilog module for
vector operation should be more compact too.
The second idea for future research is to use
polyadic summators with a few internal carry prop-
agation chains.
References:
[1] Benoit Jacob, Quantization and Training of Neu-
ral Networks for Efficient Integer-Arithmetic-
Only Inference, CVPR, 2018
[2] Hao Wu, Integer Quantization for Deep Learn-
ing Inference: Principles and Empirical Evalua-
tion, arXiv:2004.09602, 2020
[3] Omondi A., Premkumar B. Residue number sys-
tems: theory and implementation, Imperial Col-
lege Press, 2007.
[4] Salamat S. RNSnet: In-Memory Neural Net-
work Acceleration Using Residue Number Sys-
tem, IEEE International Conference on Reboot-
ing Computing, 2018.
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2023.22.33
Sergey Sivkov
E-ISSN: 2224-2872
300
Volume 22, 2023
[5] Nagornov N. RNS-Based FPGA Accelerators for
High-Quality 3D Medical Image Wavelet Pro-
cessing Using Scaled Filter Coefficients, IEEE
Access, 2022.
Contribution of Individual Authors to the
Creation of a Scientific Article (Ghostwriting
Policy)
All work done solely by a single author of this article.
Sources of Funding for Research Presented in a
Scientific Article or Scientific Article Itself
No funding was received for conducting this study.
Conflicts of Interest
The authors have no conflicts of interest to
declare that are relevant to the content of this
article.
Creative Commons Attribution License 4.0
(Attribution 4.0 International , CC BY 4.0)
This article is published under the terms of the
Creative Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en
_US
WSEAS TRANSACTIONS on COMPUTERS
DOI: 10.37394/23205.2023.22.33
Sergey Sivkov
E-ISSN: 2224-2872
301
Volume 22, 2023