Residue Number Systems Quantization for Deep Learning Inference

SERGEY SIVKOV

Electrical Engineering Faculty

Perm National Research Polytechnic University

614013, Perm, 7 Professora Pozdeeva Street, Office 225

RUSSIA

Abstract: Quantization of learned CNN weights to Residue Number System can improve inference latency by

taking advantage of fast and precise low bit integer arithmetic. In this paper we review the mathematical aspects

of RNS operations for signed integer values and evaluate implementation choices for conversion of conventional

float-point PyTorch weights of CNN models to RNS representation. We also present a workflow to convert

weights of PyTorch neural network layers specific for computer vision domain to 4-bit RNS moduli-sets able to

maintain classification accuracy within 5% of 8-bit quantization baseline.

Key-Words: FPGA, inference, RNS, Residue Number System, Verilog, PyTorch, Quantization

1 Introduction

At the present time, most neural networks are trained

using weights represented by 32-bit single-precision

floating-point numbers format. Pretrained weights

can be converted (quantized)[1] to other floating-

point number formats, such as 16-bit floating-point

(IEEE fp16) without noticeable loss of object classifi-

cation precision. It is also possible to convert weights

to integer formats, with a greater loss of accuracy dur-

ing inference.

The main advantage of quantization into integer

formats is increased performance: for 8-bit integers

this is, theoretically, a 16x acceleration in computa-

tion throughput and a 4x reduction of the transmit-

ted data bandwidth [2]. Also, inference engine imple-

mented on low-bit integers can be used on low-power

or wearable devices that do not have a hardware based

floating-point arithmetic.

The main contribution to the inference latency of

neural network calculations is made by the process of

matrix calculations, which comes down to a sequence

of multiplication and addition operations. If we look

at the implementation of an ripple carry adder and at

the implementation of a multiplier in the form of an

array of adders, it is obvious that the execution time

of the addition operation linearly depends on the bit

size of the operands O(N), and the execution time of

the multiplication operation quadratically depends on

the bitness of the operands O(N2). Also, reducing

the bit content of the operands allows to implement

a computer circuit with a shorter critical path, so we

may to increase its clock frequency.

In addition to positional number systems, non-

positional number systems are also known, for exam-

ple, Residue Number Systems (RNS) [3], where the

use of independent and parallel operations on low-

bit coprime bases makes it possible to perform the

same basic operations in time O(√N)for addition

and O(N)for multiplication. We propose to use RNS

for the FPGA based implementation of base opera-

tions required to neural network accelerator.

2 Quantization Fundamentals

Increasing the speed of inference of neural networks

is an urgent task that can be solved by quantizing

weights into integers. Quantization is based on the

affine transformation f(x) = s·x+z, in which we

transform the input value x∈[β,α] into the range

[−2b−1,2b−1− 1].

Due to the peculiarities of the formation of the

signed representation of numbers in RNS, we will

consider only quantization into a symmetric interval,

with a coefficient z= 0 (only scaling the interval,

without moving the 0th point of the representation).

This transformation is described for the interval x∈

[-α,α], and the coefficient sis found as

s=2b−1

α(1)

This scheme is implemented, for example, in the

PyTorch torch.ao.quantization module, for quantiza-

tion to 8-bit integers. In case the source code of the

model is available, this approach allows to carry out

layer-by-layer quantization of the model. Unfortu-

nately, the module does not yet support quantization

to smaller data types, for example, to 4-bit integers

or quantization with the single coefficient sfor the

entire model. Also, this approach assumes the use

Received: July 28, 2023. Revised: October 23, 2023. Accepted: November 25, 2023. Published: December 31, 2023.

WSEAS TRANSACTIONS on COMPUTERS

DOI: 10.37394/23205.2023.22.33

Sergey Sivkov

E-ISSN: 2224-2872

296

Volume 22, 2023

of either integers of higher bit capacity or float-point

arithmetic to renorm weights after forward pass layer

calculations, which will introduce additional delays

into the inference latency for a low-bit CPU. The lim-

itations inherent in calculations on numbers repre-

sented in the positional number system can be over-

come by using calculations on numbers represented

in the RNS.

2.1 Unsigned RNS Fundamentals

The limitations inherent in calculations on numbers

represented in the positional number system can be

overcome by using calculations on numbers repre-

sented in the RNS. Let us show the basic operations in

RNS based arithmetic on a simple example with two

coprime moduli-set:

p1 = 3 and p2 = 7

with such a choice of the moduli-set the range of rep-

resentable numbers will be [0..3·7−1).

Let’s take the number 17 and convert it to RNS

(hereinafter, % is the operation of taking the remain-

der modulo):

17 = (17%3; 17%7) = (2; 3)

This method requires repeated use of the opera-

tion of obtaining the remainder of an integer division.

There is another way to quickly convert from a po-

sitional number system, such as decimal, hexadeci-

mal or binary, which is based on the basic property of

RNS: performing operations on the bases in parallel

and independently.

For example, consider the conversion from the

decimal system to RNS:

17 = 1 ∗101+ 7 ∗100

Knowing the representation of all powers of 10,

in the range allowed for RNS:

100= (1; 1), because 1%3 = 1 and 1%7 = 1

101= (1; 3), because 10%3 = 1 and 10%7 = 3

Also, knowing the representation of each digit of

the number in RNS:

1 = (1%3; 1%7) = (1; 1)

7 = (7%3; 7%7) = (1; 0)

We get the following representation of the transla-

tion operation:

17 = 1∗101+7∗100= (1; 1)∗(1; 3)+(1; 0)∗(1; 1) =

(1; 3) + (1; 0) = (2; 3)

2.2 Signed RNS Fundamentals

In the process of calculating the layers of a neural net-

work, we need to be able to represent both positive

and negative weights in the RNS. There are several

approaches to solving this problem, let’s look at rep-

resenting numbers with a range shift (an socalled arti-

ficial number form), where both negative and positive

numbers are mapped to positive numbers.

As an example, we can extend our example with

the radix p3=2, and use a range-shifted represen-

tation of numbers, in which numbers in the range

[0..p1·p2) represent negative numbers, and numbers

in the range [p1∗p2..p1∗p2∗p3) are positive.

Then the form of representation of the number N

with offset: N0=p1∗p2 + N

N0=p1∗p2−N

The operations of addition and subtraction in this

case can be derived as:

let N10=p1∗p2 + N1and N20=p1∗p2 + N2

then N10+N20=p1∗p2 + N1 + p1∗p2 + N2

considering that (N1 + N2)0=N1 + N2 + p1∗p2

we get (N1 + N2)0=N10+N20−p1∗p2

Taking into account that p1∗p2is (0; 0; 1) in RNS

with p3 = 2 added to moduli-sets,

we get (N1 + N2)0=N10+N20−p1∗p2 =

(N1 + N2)0=N10+N20+p1∗p2

You can also notice that getting from the number



N



numbers −



N



perhaps by subtracting it from the

number (p1; p2; 1).

Consideration of the multiplication operation

requires similar reasoning:

N10∗N20=N1∗N2 + p1∗p2∗N1 + p1∗p2∗

N2 + p1∗p2∗p1∗p2

those

(N1∗N2)0=N10∗N20+p1∗p2−p1∗p2∗(N1 +

N2 + p1∗p2)

considering that p1∗p2is (0; 0; 1) in the RNS with

p3=2added, and that during the operation the

numbers will be given in prime form, we get

(N1∗N2)0=N10∗N20+p1∗p2∗(1 + p1∗p2 +

N10+N20)

since p1∗p2is an odd number, we get

(N1∗N2)0=N10∗N20+p1∗p2∗(N10+N20)

Note that if N10and N20have the same parity, then

p1∗p2∗(N10+N20)is (0; 0; 0)

And if their parity is different, then

p1∗p2∗(N10+N20)is (0; 0; 1), i.e. p1∗p2.

Also, for RNS it is quite simple, using pre-

calculated auxiliary bit vectors, to implement the op-

erations of checking the sign of a number and dividing

by the bases of the system, which is sufficient both to

implement the ReLU operation and to implement ap-

proximate dequantization after matrix operations.

More general comparison or division operations in

RNS are also feasible, but require storage or calcu-

lation of additional parameters required for such op-

erations. Because These operations are not used to

implement the basic functions of the RNS inference

WSEAS TRANSACTIONS on COMPUTERS

DOI: 10.37394/23205.2023.22.33

Sergey Sivkov

E-ISSN: 2224-2872

297

Volume 22, 2023

accelerator of the NN, then difficulties with the imple-

mentation of these operations do not affect our work.

2.3 Suggested workflow formulation

The objectives of this work are:

• selection of a family of NN models and their

characteristic layers for implementation in the

form of Verilog code of the RNS computer

• transfer of model weights to a quantized repre-

sentation in RNS

• construction of a reference model for the possi-

bility of checking the results of verification of the

Verilog code of the calculator RNS

• implementation of basic blocks of RNS for

FPGA

• comparison of the obtained result with the in-

ference accuracy of float32 weights and int8

weights of the original model.

3 Problem Solution

As a target family of neural network models, we con-

sider neural network architectures used in computer

vision, due to the huge number of devices such as

smart video cameras or wearable electronics that use

them. Also, we consider only those types of layers

that can be used in single-stage neural networks for

classification, localization or detection tasks. This

family includes convolutional neural networks with

the following layers (in PyTorch terms):

• Conv2d layer providing convolution operation

• ReLU layer ensuring the introduction of nonlin-

earity into the neural network

• MaxPool2d layer providing selection of the max-

imum value in a given window of the input tensor

• AvgPool2d layer providing calculation of the av-

erage value in a given window of the input tensor

• Linear fully connected layer (MLP, perceptron).

Neural networks (f.ex. LeNet, AlexNet, VGG)

built only using the listed layers have repeatedly

achieved SOTA results on standard datasets (MNIST,

CIFAR, ImageNet) for computer vision tasks.

We chose MNIST as a dataset to evaluate the re-

sults obtained.

An example code of a PyTorch NN model that

achieves an accuracy of 92.67% on float32 weights

on the MNIST dataset presented at Fig.1.

The estimate of the number of model param-

eters and its computational complexity for the

# p a r t o f c l a s s c o n s t r u c t o r

# r e s p o s i b l e f o r n e u r o n e t s t r u c t u r e :

s e l f . conv1 = nn . Conv2d ( 1 , 4 , 3 , 1 , 1 )

s e l f . r e l u 1 = nn . ReLU ( )

s e l f . mp1 = nn . MaxPool2d ( 2 )

s e l f . conv2 = nn . Conv2d ( 4 , 8 , 3 , 1 , 1 )

s e l f . r e l u 2 = nn . ReLU ( )

s e l f . mp2 = nn . MaxPool2d ( 2 )

s e l f . conv3 = nn . Conv2d ( 8 , 1 6 , 3 , 1 , 1 )

s e l f . r e l u 3 = nn . ReLU ( )

s e l f . ap3 = nn . AvgPool2d ( 7 )

s e l f . o u t = nn . L i n e a r ( 1 6 , 1 0)

# method r e s p o s i b l e f o r model

#inference :

d e f f o r w a r d ( s e l f , x ) :

x = s e l f . con v1 ( x )

x = s e l f . r e l u 1 ( x )

x = s e l f . mp1 ( x )

x = s e l f . con v2 ( x )

x = s e l f . r e l u 2 ( x )

x = s e l f . mp2 ( x )

x = s e l f . con v3 ( x )

x = s e l f . r e l u 3 ( x )

x = s e l f . ap3 ( x )

x = x . view ( x . s i z e ( 0 ) , −1)

o u t p u t = s e l f . o u t ( x )

r e t u r n o u t p u t

Fig.1: CNN Model structure and forward pass method

|module |#params or shape| #flops |

|:--------- |:----------- |:---- |

|model | 1.674K | 0.141M |

| conv1 | 40 | 28.224K|

| conv1.weight| (4, 1, 3, 3) | |

| conv1.bias | (4,) | |

| conv2 | 0.296K | 56.448K|

| conv2.weight| (8, 4, 3, 3) | |

| conv2.bias | (8,) | |

| conv3 | 1.168K | 56.448K|

| conv3.weight| (16, 8, 3, 3)| |

| conv3.bias | (16,) | |

| out | 0.17K | 0.16K |

| out.weight | (10, 16) | |

| out.bias | (10,) | |

Fig.2: CNN Model parameters and FLOPS

WSEAS TRANSACTIONS on COMPUTERS

DOI: 10.37394/23205.2023.22.33

Sergey Sivkov

E-ISSN: 2224-2872

298

Volume 22, 2023

above model by FaceBook’s PyTorch package

fvcore.nn.FlopCountAnalysis presented at

Fig.2.

The reference model was implemented in C++ in

5 versions:

1. with weights represented by real 32-bit numbers

2. with weights represented by 16-bit integers

3. with weights represented by 8-bit integers

4. with weights represented by 4-bit integers

5. with weights represented by class objects that

implement RNS operations, each base of which

does not go beyond a 4-bit integer.

The real weights for the reference model were con-

verted directly from the weights of the corresponding

PyTorch tensors of the NN model.

The values of the output vectors shown by the ref-

erence model with real weights coincided with the

weights shown by the PyTorch NN model.

The Verilog implementation of the RNS calculator

is based on a code generator for the given RNS bases

and the required NS functionality.

The capabilities of our own code generator are

clear from its prompt presented at Fig.3. An exam-

ple of generated code for the multiplication operation

in RNS based on P= 3 presented at Fig.4. Exam-

ple of generated code for the multiplication operation

for RNS with radix (P1 = 2; P2 = 3; P3 = 7) pre-

sented at Fig.5.

4 Conclusion

The PyTorch model of the convolutional neural net-

work presented at Fig.1 was trained, and 92.67% ac-

curacy was achieved after 10 epochs of training on the

MNIST dataset with batch 50 as showed at Fig.6.

The reference model implemented in C++ showed

the same accuracy and the same output vectors.

Running the reference model with a single scaling

factor on int16 scales shows 80.12% accuracy.

Running the reference model with a single scaling

factor on int8 scales shows 74.30% accuracy.

Running the reference model with a single scaling

factor on int4 scales shows 14.58% accuracy.

Launch of a reference model with a single scaling

coefficient on RNS scales with bases (P1 = 2, P 2 =

3, P 3=5, P 4 = 7), none of which go beyond int4

shows 71.13% accuracy.

Below is a Table 1 comparing the number of LEs

involved when implementing the multiplication oper-

ation both in the positional number system and in the

RNS: We used RNS with about the same bit arity as

their 2n*2nscheme as shown at Table 2.

gen.py --help

usage: g.py [-h] [--signed SIGNED]

[--op OP]

[--base BASE] [--rnsop RNSOP]

[--vecop VECOP]

[--bases BASES [BASES ...] ]

[--tests TESTS]

optional arguments:

-h, --help show this help message

and exit

--signed SIGNED Use artificial number

representation

--op OP Type of operation to generate:

mul, add, div2, neg (by default all

modules)

--base BASE Value of base to generate

--rnsop RNSOP Type of RNS operation to

generate: mul, add, div2, neg, lte

(by default all appropriate modules)

--vecop VECOP Type of operation to

generate: mul, add, div2, neg, lte,

k3x3, max_pool, gac_pool, relu

(by default all appropriate modules)

--bases BASES [BASES ...]

List of bases to use for vector

operations

--tests TESTS Generate coverage and

randomized tests for required

operations

(may be valued as [1..100])

Fig.3: Help screen of Verilog code generator

module mulP3

(input logic[1:0] a, b,

output logic[1:0] y);

logic[1:0] m;

always_comb

case ({a,b})

'b01_01: m <= 1;

'b01_10: m <= 2;

'b10_01: m <= 2;

'b10_10: m <= 1;

default: m <= 0;

endcase

assign y = m;

endmodule

Fig.4: Generated Verilog code for one RNS moduli

base operation

WSEAS TRANSACTIONS on COMPUTERS

DOI: 10.37394/23205.2023.22.33

Sergey Sivkov

E-ISSN: 2224-2872

299

Volume 22, 2023

module mul2P2P3P7

(input logic[2:0] a1, b1,

input logic[2:0] a2, b2,

input logic[3:0] a3, b3

output logic[2:0] y1,

output logic[2:0] y2,

output logic[3:0] y3);

mulP2(a1, b1, y1);

mulP3(a2, b2, y2);

mulP7(a3, b3, y3);

endmodule

Fig.5: Generated Verilog code for RNS moduli-set

operation

Fig.6: Traing/test accuracy

Bit-arity multiplier RNS moduli-set

LEs delay LEs delay

8* 8 46 17ns 20 11ns

16*16 183 26ns 245 15ns

32*32 625 32ns 3577 23ns

64*64 2273 47ns 13009 31ns

Table 1: Circuit sizes and latency

Bit-arity RNS moduli-set bases

282; 3; 5; 7

216 11; 17; 19; 23

232 11; 17; 19; 23; 29; 31; 61

264 11; 13; 17; 19; 23; 29; 31;

37; 41; 43; 47; 53; 59

Table 2: List of usedRNSmoduli-setbases

An we can see RNS scheme with 4 bases, two of

which are 2-bit and another two are 4-bit, shows an

accuracy that differs by 4.46% from the accuracy of a

scheme on 8-bit integers and exceeds the accuracy of

a scheme on 4-bit integers by 387.86%.

At the same time, the speed of the multiplication

operation in the RNS, which is basic for matrix cal-

culations, is 64.7% faster than the 8-bit multiplier

and the required number of LEs for implementation

is 43.47% less.

The paper [4] showed a nice example of specific

moduli-set (P1 = 2n−1, P 2 = 2n, P 3 = 2n+

1). Such moduli-set allows very concise, in terms of

LEs, design. Also, the paper shows an important idea

to measure power consumption of our scheme. This

question will be the subject of our future research.

From a practical point of view, it is important to

be able to accept as input not quantized data con-

verted from the positional number system to RNS, but

the original stream from the CMOS sensor, for exam-

ple, in the R5G6B5format, to reduce the load on the

CPU and/or directly connect the CMOS sensor to the

FPGA and quantize them In parallel with the calcula-

tion of the convolutions of the first layer, ideas for de-

riving a recurrence formula for each base are outlined

in Omondi’s monograph [3]. Authors of [5] paper use

another method to convert numbers to RNS but also

show a good performance boost for their design.

We plan to implement a more compact version of

this accelerator design based on the following idea:

because we know all weights for all neural network

layers for inference tasks, no one multiplicator or

summator needs in two argument realization. I.e. one

argument already known, so the Verilog module for

vector operation should be more compact too.

The second idea for future research is to use

polyadic summators with a few internal carry prop-

agation chains.

References:

[1] Benoit Jacob, Quantization and Training of Neu-

ral Networks for Efficient Integer-Arithmetic-

Only Inference, CVPR, 2018

[2] Hao Wu, Integer Quantization for Deep Learn-

ing Inference: Principles and Empirical Evalua-

tion, arXiv:2004.09602, 2020

[3] Omondi A., Premkumar B. Residue number sys-

tems: theory and implementation, Imperial Col-

lege Press, 2007.

[4] Salamat S. RNSnet: In-Memory Neural Net-

work Acceleration Using Residue Number Sys-

tem, IEEE International Conference on Reboot-

ing Computing, 2018.

WSEAS TRANSACTIONS on COMPUTERS

DOI: 10.37394/23205.2023.22.33

Sergey Sivkov

E-ISSN: 2224-2872

300

Volume 22, 2023

[5] Nagornov N. RNS-Based FPGA Accelerators for

High-Quality 3D Medical Image Wavelet Pro-

cessing Using Scaled Filter Coefficients, IEEE

Access, 2022.

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

All work done solely by a single author of this article.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding was received for conducting this study.

Conflicts of Interest

The authors have no conflicts of interest to

declare that are relevant to the content of this

article.

Creative Commons Attribution License 4.0

(Attribution 4.0 International , CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on COMPUTERS

DOI: 10.37394/23205.2023.22.33

Sergey Sivkov

E-ISSN: 2224-2872

301

Volume 22, 2023