Network and performance monitoring is an important task
and responsibility for every Network Service Provider (NSP)
operator. There are different methodologies onto how network
traffic and key parameters for devices and network services
are being monitored and processed, such as Netflow/SFlow,
SNMP (RMON), Model driven telemetry (MDT)/streaming
telemetry, command line interface (CLI) outputs, synthetic
network tests etc . One point that is in-common is that all
monitoring methods are aimed at establishing whether service
level agreement (SLA) of the desired network service has been
fulfilled. For example, SLA compliance data generated by
synthetic network tests give us info about state of the network
service as “End to End” result, however synthetic test results
give no insights onto what could be causing service
degradation, since synthetic tests don’t have awareness of the
data path through the network. With legacy methods such as
Netflow, SNMP, CLI it’s usually possible to retrieve
operational data about network elements, but there is very
little or no information about the state of the network services
that are being deployed on those elements. In recent years,
with advancements of streaming telemetry we’re receiving
even more data on the network elements, devices, yet very
little information in the context of network service, which
could help us to identify cause of the service level degradation.
Overall, it is very difficult for network operator to identify
what is the cause of network service degradation, since each
monitoring method operates in its own context: legacy device
operational data is focused on devices and don’t contain info
on services, synthetic tests have high level service info and
don’t have insights on the data path, telemetry is streaming
operational data with no service context information. Hence,
there is a need to create correlation between all the data points
which will provide insights in the service context both on high
level, as well as low-level to help identify the cause of the
service degradation.
To understand which data is available for monitoring and
how to monitor the network service in question it’s necessary
to perform service decomposition and establish dependencies
between different service components. Performing Network
Service Decomposition on the configuration deployed by a
service orchestrator, such as NSO[2], ONAP[3] or similar
involves analysing the configuration using the heuristics to
perform mapping between Service Configuration all the way
to the monitored data such as YANG path, SNMP Object
Ids or CLI outputs. The analysis is performed in the following
order: Network service configuration, service components,
expressions, metric (YANG path).
Example of the Assurance Expression Tree:
Assurance hierarchy between following network Service
components
1. Service components: tunnel, VPN, ...
2. Service components: (sub)interface, device, interior
routing protocol
3. Service components can depend on other services
4. Service components are assured on the agents
The above-mentioned decomposition is illustrated on Fig. 1,
where GRE Tunnel network service has been decomposed to
its service components. GRE Tunnel network service
Network Service Assurance and Telemetry optimisation using
Heuristics
1MIOLJUB JOVANOVIC, 2MILAN CABARKAPA, 1,2DJURADJ BUDIMIR
1Wireless Communications Research Group University of Westminster London, UK
2Department of Telecommunications School of Electrical Engineering University of Belgrade Belgrade,
SERBIA
Abstract: This paper identifies possible solution to reduce amount data retrieved for purpose of network
monitoring, while retaining the quality of information. Legacy methods used for network service monitoring
data using and data retrieved by use of synthetic tests are usually decoupled – describing different contexts. As
a result, there is too many monitored operational parameters on monitored devices and consequently too much
measurement data exported from network service probes. Novel approach is suggested to perform heuristic
analysis of the network service configuration and decomposition of collected data at the network edge. Objective
is that amount of collected data can be reduced while at the same time quality of the presented information can
be improved by providing clear correlation between network service and telemetry data.
Keywords: intent based network; intent-aware monitoring agent; model-driven telemetry; service assurance
formatting.
Received: October 8, 2021. Revised: June 21, 2022. Accepted: August 12, 2022. Published: September 2, 2022.
1. Introduction 2. Decomposition of the Network
Service Configuration
WSEAS TRANSACTIONS on COMMUNICATIONS
DOI: 10.37394/23204.2022.21.29
Mioljub Jovanovic,
Milan Cabarkapa, Djuradj Budimir
E-ISSN: 2224-2864
244
Volume 21, 2022
obviously depends on health of Tunnel Interface, which in-
turn depends on the underlying physical/logical interface.
Ultimately, underlying interface health will also depend on
the device where the specific interface resides. Apart from the
GRE Tunnel Interface, GRE Tunnel as a service also depends
on the Layer3 connectivity, which is also dependent on the
health of interior routing protocol on the device itself. Finally,
routing protocol’s health depends on the health of egress
interface, which is forward the network traffic, along with the
routing protocol information as well.
Fig. 1. gre tunnel network service assurance decomposition
Main hypothesis of the research is to confirm whether
amount of the telemetry data could be reduced by leveraging
service awareness by means of heuristics. In the Fig. 7
experimental network topology along with the deployed
assurance agents have been shown.
Analysis Engine: maps abstract metrics to device-specific
metrics implementation
Fig. 2. applying rules on the configuration of a service instance results in an
assurance graph that connects service components according to their
dependencies.
Transformation of an assurance graph with one service
instance depending on two service components into an
expression graph. Following step is to apply techniques
described in the above work to the ontology-based system,
which could be used as foundation to the reasoning engine
for analysis and automated error detection.
Expression tree for the service component can be quite
complex, which makes it inherently difficult to navigate
through large number of different computations and
dependencies.
By applying advanced data modelling techniques, such as
Ontologies, we could establish inferred relationships between
raw data sources such a MDT and legacy protocols: SNMP,
Syslog, RMON or even CLI to ensure higher level of
precision and relevance within calculated assurance value for
any specific service component assurance expression tree.
Following the Ontology based model and reasoning engine,
we could potentially insert inferred rules determine causal
relationships using reasoning engine instead of heuristic
packages build by a Subject Metter Experts.
Notion of heuristic packages has been created and used to
encode human knowledge within the rule files with the
objective to:. 1. Automate building and monitoring of
assurance cases for network services and to 2. Enabling
exchange and reuse of assurance case between experts and
operators
By naming heuristic objective is to cover a large part of most
common issues. However, in some cases:
A service reported as “Broken” might actually be
functional
A service reported as “Healthy” might actually be non-
operational
In such cases, heuristics packages can be extended to improve
coverage and accuracy.
Heuristic packages contain 3 hierarchical layers, higher ones
depending on lower ones:
1. Metric Engine: Device-Independent abstraction of
metcis to fetch and process
2. Service Components: Compute status of a specific
part of the networking system, based on Metrics.
3. Rules: Combine Service Components to assure a
whole network service, based on service
configuration
Metric engine provides an abstraction of metrics that can be
extracted from network devices. Selects device-specific
implementations for each metric.
Fig. 3 Metric Engine
Service component focuses on a very specific and well-
scoped part of the networking system. Service component
reports status by performing evaluation and returns health
score in the interval between 0(broken) to 1(healthy) +
symptoms. Service components are reusing metrics from the
metrics engine.
Fig. 4 Service component
Rules are expected to parse configuration pushed by Network
Configuration Orchestrator (NSO, ONAP) to enable a service
to produce an assurance graph. Assurance graph: service
3. Analysis Engine
4. Heuristic Packages
WSEAS TRANSACTIONS on COMMUNICATIONS
DOI: 10.37394/23204.2022.21.29
Mioljub Jovanovic,
Milan Cabarkapa, Djuradj Budimir
E-ISSN: 2224-2864
245
Volume 21, 2022
components with parameters from the configuration and
dependencies. Dependencies aggregate symptoms to explain
services malfunction.
Fig. 5 Rules
Heuristic packages are prepared and created by input from
Subject Matter Experts (SMEs) by conducting following
steps, aligned with previously defined heuristic package
structure:
Metric Engine: SME specifies method for retrieving given
metric for a device platform or platforms involved in network
service configuration. Currently this step is coded in JSON.
Service components: SME lists symptoms and trigger
conditions based on the metrics. This step is coded in Service
Component Domain Specific Language (SCL), newly
defined language described in more detail below
Rules: SME analyses and decomposes service components
needed for a network service feature. Currently logic for this
step is implemented in Python.
Service component language (SCL), Domain Specific
Language (DSL) has been developed to describe the
computation needed to evaluate the status of a service
component.
Overview of SCL:
Service components is specified by the following elements:
a name and a list of arguments
a list of metrics to collect
a list of expressions to combine the metrics into a
single status value
To further describe and explain SCL we’ll demonstrate each
of these elements through examples. The complete
specification of the syntax is detailed below.
Global structure of a heuristic package file hierarchy:
Each file as the following structure:
a single level 1 header (The name of the service
component)
A description of the service component
o a single level 2 header 'Arguments'
The arguments of the service component
o a single level 2 header 'Expressions'
a sequence of level 3 blocks (either 'Measure'
or 'compute')
'Measure' blocks contain a list of
metrics to monitor
'Compute' blocks contain a list of
expressions to compute
Name and arguments:
The name is the level 1 header of the file. Arguments are
defined in a list using '*' as an item:
# InterfaceHealthy
Checks whether a given interface on a given device
is healthy.
## Arguments
* device device: Device supporting the
interface to check.
* str interface: Name of the interface to
check.
The above example defines the name and arguments
of InterfaceHealthy. An instance of this service component is
totally parameterized by a device (of type device) and an
interface (of type str).
Metrics: Each metric to collect is defined via a name and a
set of parameters. An example is
* str admin_status =
interface.administrative_status(device=device,
interface=interface)
_ Whether the interface is currently enabled_
Expressions: Each expression is defined via a name,
optionally a symptom to raise, and a list of potential
expressions. An example reusing the previous metric is
* is_up: Whether the interface is currently
enabled
broken if false: Interface is down
+ `admin_status == "UP"`
Here a single expression is used. If this expression evaluates
to false, then a symptom is raised. However, depending on
the available metrics, we might want to use different
expression to compute a given value. For instance, assuming
that two devices: * device 1 provides "totalmemory" and
"freememory" metrics * device 2 provides "totalmemory"
and "usedmemory" metrics
In that case, if we want to check that at least 10% of the
memory is available, one can write:
* memory_healthy: At least 10% of the memory is
available
+ `free_memory/total_memory > 0.1`
+
`Minus(total_memory,used_memory)/total_memory >
0.1`
Convention: A possible way to organize the expressions
section is to decompose the service component. For instance,
DeviceHealthy can be divided into cpuhealthy,
memoryhealthy, storage_healthy ... For each subexpresssion,
include a "Measure" block will all the metrics (i.e. relative to
CPU) followed by a "Compute" block with all the
expressions needed to assign a value to CPUHealthy.
Finally, the last expression shall combine all expressions of
the subparts into a single expression summarizing the value.
Operators: The language in itself does not define any operator
since operators need to be externally defined, by convention
in the expressions folder.
Metrics: In the proposed DSL, name of the metric and is
created to reflect definition of a metric and a set of
parameters. To actually specify how to retrieve a value for a
given device with a given OS, one has to add an entry in the
relevant file of the metrics folder.
Spaces tabs and new lines are ignored except otherwise
specified.
The syntax of the language is compatible with Markdown: i.e.
if spaces and new lines are correctly ordered, the file will
correctly render in Markdown. However, it is also possible to
write syntactically correct file that do not render well in
Markdown.
Syntax : Here is a human readable version of the syntax. Non
terminals are in lower case, terminal are in upper case.
WSEAS TRANSACTIONS on COMMUNICATIONS
DOI: 10.37394/23204.2022.21.29
Mioljub Jovanovic,
Milan Cabarkapa, Djuradj Budimir
E-ISSN: 2224-2864
246
Volume 21, 2022
service_component := header arguments expressions
Header: The file starts with a header that defines the name
of the service component and a description of the service
component.
header := '#' ID SUB_DESCRIPTION
ID : any sequence of letters, numbers or '_' starting by either a letter or an
'_'
SUB_DESCRIPTION: anything that does not contain '#'
The ID terminal is used for every ID in the syntax. The spaces
IN the description are kept, and as long as the character '#' is
not met, the sequel is considered part of the description. Thus
it is possible to use any markdown except title with "#" in the
description.
Arguments: The arguments start with a level-2 Markdown
title, followed by a list of a least one argument and optionally
some global parameters. ``` arguments := '## Arguments'
(argument)+ (display_params)?
argument := '*' ID? ID ':' LINE_DESC
LINE_DESC: any string not containing a
newline. ```
In the argument syntax, the first ID indicates (optionally) the
type and the second one is the actual name of the argument.
The LINE_DESC contains a description of the argument,
terminated by a new line.
The display parameters indicate how to render the service
component in a GUI.
display_params := 'display level' '=' ID
'web_label' '=' web_label
web_label := BACKQ3 TEXT ('{' ID '}' TEXT)* BACKQ3
BACKQ3 : 3 backquotes ('`')
TEXT: any string not containing '{'
The display level ID should be one of the levels defined
in ServiceInstance Class definition.. The web label contains
an arbitrary string with some IDs enclosed in braces. The IDs
should be arguments declared before. The web label will be
formatted using the python 'format' function to replace
argument's ID in braces with the value of the argument.
Expressions: The expressions start by a level 2 title
'Expression' and contains a sequence of measurments and
computations.
expressions := '## Expressions' ( measurements |
computations )*
Measurements: A set of measurements is introduced by the
level 3 title 'Measure' followed by a comment (optional), and
then a list of measurement (i.e. metrics) to obtain. As
explained above in this document, the metrics have to be
defined in the metrics folder.
measurements := '### Measure' COMMENT measurement+
measurement := '*' ID? ID '=' METRIC_NAME '('
measurement_parameter (',' measurement_parameter)*
')' '_' M_DESCR '_'
measurement_parameter := '-' ID '=' ID
COMMENT : any string not containing '*'
METRIC_NAME: identifiers separated by dots
M_DESCR: any string not containing '_'
For each measurement, we have two IDs, the first one,
optional, indicates the type of the metrics, the second one
indicates the name of the measurement. After the equal sign,
the definition of the metric instance to associate to the
measurement name is given. For instance:
interface.mtu(device=source_device, interface=source_interface),
where sourcedevice and sourceinterface are arguments of the
service component.
The Metric Name should match an existing metric in the
metric engine. The parameters name should be existing
parameters of that metric. Finally, the description of the
measurement, which should not contain _ is enclosed
between _.
Computations: A set of computations is introduced by the
level 3 title 'Compute' followed by a comment (optional), and
then a list of computations.
computations := '### Compute' COMMENT computation+
computation := '*' ID ':' LINE_DESC symptom?
expression_decl+
symptom := LEVEL CONDITION: LINE_DESC
LEVEL: 'broken' | 'degraded'
CONDITION: ('if false'|'if true')
expression_decl := '+' '`' expression '`'
A computation is defined by: * a name * a one line
description * an (optional) symptom * at least one expression
declaration
The name is defined by the first ID in 'computation'. It should
not be used by a previous argument, measurement or
computation.
The symptom can only be added if the expression evaluates
to a boolean value. Level and Condition indicates the level of
the symptom (degraded =~ warning and broken =~ error).
There can be several expression definition for the same
expression name. If so, the first definition in order for which
all subexpressions (i.e. references to other expresions,
arguments or metrics) are available is taken.
This mechanism allow to have flexibility in the expression as
shown above.
The label of the symptom can contain expressions enclosed
between backquotes '`'. In that case, the expression is replaced
by its value whenever the symptom is raised (the expressions
are always evaluated.) As a corollary, the compiler does not
allow the expressions used in the symptom labels to depend
on a metric that is not already a dependency of all expression
alternatives. For instance:
### Measure
* admin_status: Administrative status of the
interface
[...]
* errors_count: Number of errors on the interface
[...]
### Compute
WSEAS TRANSACTIONS on COMMUNICATIONS
DOI: 10.37394/23204.2022.21.29
Mioljub Jovanovic,
Milan Cabarkapa, Djuradj Budimir
E-ISSN: 2224-2864
247
Volume 21, 2022
* interface_healthy: Whether the interface is
healthy
broken if false -> Interface is not up status:
`admin_status` or too much errors `errors_count`
+ `admin_status == "Status UP" and errors_count
< 10`
+ `admin_status`
will not compile because the symptom label depends
on errors_count but the last alternative doesn't.
Expression The expressions have the following syntax:
expression := sum (cmp sum)?
cmp := '==' | '<=' | '<' | '>=' | '>'
sum := factor ('+' factor)*
factor := atom (( '*' | '/') atom)*
atom := ID | INT | FLOAT | STRING | BOOL | '('
expression ')' | call
call := ID '(' expression (',' expression)* ')'
This syntax supports expressions using the arithmetic
operators '+', '*' and '/', the comparision operators in cmp,
identifiers, litterals of floats, ints, booleans and strings, and
function calls.
Function calls are used for operators that do not have a infix
version. The first ID is the name of the operator, which is
checked again classes defined in the expression folder. In
particular the number of arguments is checked. The
expression and metrics can be declared in any ordered.
However, circular dependencies between expressions are not
allowed.
Now that methodology of the service decomposition has been
understood, as depicted on Error! Reference source not
found. along with heuristic packages concept and SCL we
could proceed to explain decomposition of the actual tunnel
service, along with all its service components, as depicted at
Fig. 6. Each box represents the result of the evaluation for the
specific service component where boxes coloured in Green
represent service components which have been determined as
healthy based on the telemetry data, CLI or SNMP outputs,
whatever it may it depend upon. Grey boxes represent service
components which state couldn’t be conclusively determined,
since there is insufficient data to deem service component
healthy or unhealthy
Fig. 6. Graphical representation of the service component assurance tree
Example SCL code which performs service component
assurance tree:
Is interface flapping.
flapping_if = Delta1min(NofChanges(last-change))
<= 1
Is interface reported and configured UP.
if_up = (enabled == True) * (admin-status == ‘UP’)
* (oper-status == ‘UP’)
Total number of packets correctly received or
sent.
ok_packets = in-unicast-pkts + in-broadcast-pkts +
in-multicast-pkts + out-unicast-pkts + out-
broadcast-pkts + out-multicast-pkts
Total number of errors (input and output).
errors = in-errors + out-errors
Whether the number of errors is low.
low_errors = errors <= 0.01 * ok_packets
Is there some traffic (0.5 -> low traffic, 1.0 ->
normal traffic.)
some_traffic = ok_packets > 10 / 2 + 0.5
Is interface healthy.
interface_healthy = if_up * low_errors *
some_traffic * flapping_if
Above mentioned expression and conditions are graphically
represented on the Fig. 6. where it can be observed that
interface could be considered as healthy (health=1.0,
meaning 100% healthy) if all service components are also
healthy: interface is up, number of errors is low, there is
traffic on the interface and interface is not flapping. Same
procedure is then performed on each of the
Branches of the Assurance tree and their logic are discussed
below:
Interface is considered “UP” if both administrative enabled
state = True and operational state = True
Interface errors are considered low if both in-errors and out-
errors are 0. … and so on
Experimental setup consists simulated customer premises
routers (CE) as well as provider core routers (P) and provider
edge routers (PE). As shown in Fig. 7, the network with
service models is configured using the orchestration network
architecture. In the provided example, actual network service
intent is to establish communication between Client-1 and
Client-3, which in-turn means that communication needs to be
established by creating tunnel service between ce-1 and ce-3
network devices. Each of the network devices is streaming
telemetry data to the collector, monitoring platform which is
receiving and processing all telemetry data.
Fig. 7. experimental topology setup with telemetry data streamed from the
devices to the monitoring platform
Measuring objective was to determine how much data is
actually received via MDT under usual telemetry export, with
typical data points for router such as environmental, interface
stats etc. Result of this work outlines amount of measured data
after performing analysis of the incoming telemetry and
mapping to service aware MDT. All routers and all incoming
data points were taken into account.
5. Service Component Assurance
Tree Example
6. Experimental Setup
7. Results
WSEAS TRANSACTIONS on COMMUNICATIONS
DOI: 10.37394/23204.2022.21.29
Mioljub Jovanovic,
Milan Cabarkapa, Djuradj Budimir
E-ISSN: 2224-2864
248
Volume 21, 2022
As outlined in Error! Reference source not
found.demonstrated experimental results have reduced the
amount of incoming MDT from routers from 5.2 GB to 130
MB, while preserving relevant information which is is
service running and operational per pre-defined KPIs. In short,
instead of sending large amount of measured data, model
driven telemetry measurements etc, it is possible to send
relevant service aware telemetry data which represents
computed state of the network service.
By leveraging the novel service assurance approach by
decomposing service configuration and calculating service
health using heuristics, the service definition and
construction of this assurance component graph it is possible
to reduce amount of telemetry data exported from the
network and export only service-intent relevant information
instead of raw device data. This in turn means that it’s
possible to determine service health at the edge and
contribute to service assurance in more efficient manner than
traditional means of telemetry compression or establishing
different channels to send same amount of raw telemetry data.
We’ve presented the designed architecture, which is capable
to, at almost real-time, perform analysis of the data streams
and perform computations to establish the network service
health status.
Future research would involve applying advanced techniques
such as machine learning (ML) or artificial intelligence (AI)
on raw data received from monitored devices in an attempt to
identify data clusters and dependencies between different
data sets. Objective of ML/AI data analysis approach would
be to either augment human-defined heuristic packages or to
create machine-built heuristics.
[1] https://docs.openstack.org/tacker/latest/
[2] https://cloudify.co/
[3] https://www.onap.org/
[4] https://www.cisco.com/c/en/us/solutions/service-provider/solutions-
cloud-providers/network-services-orchestrator-solutions.html
[5] Anil Rao, “Reimagining service assurance for NFV, SDN and 5G”,
White paper, Analysis Mason, 2018.
[6] R. Mijumbi, J. Serrat, J. l. Gorricho, S. Latre, M. Charalambides, and
D. Lopez, “Management and Orchestration Challenges in Network
Functions Virtualization,” IEEE Communications Magazine,vol. 54,
no. 1, pp. 98–105, Jan 2016.
[7] A. J. Gonzalez, G. Nencioni, A. Kamisiski, B. E. Helvik, and P. E.
Heegaard, “Dependability of the NFV Orchestrator: State of the Art
and Research Challenges,” IEEE Communications Surveys Tutorials,
pp. 1–23, 2018.
[8] M. Pattaranantakul, R. He, Z. Zhang, A. Meddahi and P. Wang,
"Leveraging Network Functions Virtualization Orchestrators to
Achieve Software-Defined Access Control in the Clouds," in IEEE
Transactions on Dependable and Secure Computing, pp. 1-14, Nov.
2018.
[9] A. D’Alconzo, I. Drago, A. Morichetta, M. Mellia and P. Casas, "A
Survey on Big Data for Network Traffic Monitoring and Analysis," in
IEEE Transactions on Network and Service Management, vol. 16, no.
3, pp. 800-813, Sept. 2019.
[10] R. Boutaba, M. A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F.
Estrada-Solano, and O. M. Caicedo. A Comprehensive Survey on
Machine Learning for Networking: Evolution, Applications and
Research Opportunities. J. Internet Serv. Appl., 9(16), 2018.
[11] Cisco Systems, Inc, “GitHub Network Telemetry Pipeline,” Cisco
Systems, Inc, 2017. [Online]. Available:
https://github.com/cisco/bigmuddy-network-telemetry-pipeline
[12] M. Jovanović, M. Čabarkapa, B. Clause, N. Nešković, M. Prokin, B.
Đurađ, Model driven telemetry using Yang for next generation network
applications, 5th International Conference on Electrical, Electronic and
Computing Engineering (IcETRAN) 2018, pp. 1186 - 1189, Palić,
Serbia, June, 2018.
[13] B. Claise, J. Clarke, and J. Lindblad “Network Programmability with
YANG: The Structure of Network Automation with YANG,
NETCONF, RESTCONF, and gNMI”, Addison-Wesley Book, 1st
edition, 2019.
8. Conclusion
9. Future Research
References
Creative Commons Attribution License 4.0
(Attribution 4.0 International, CC BY 4.0)
This article is published under the terms of the Creative
Commons Attribution License 4.0
https://creativecommons.org/licenses/by/4.0/deed.en_US
WSEAS TRANSACTIONS on COMMUNICATIONS
DOI: 10.37394/23204.2022.21.29
Mioljub Jovanovic,
Milan Cabarkapa, Djuradj Budimir
E-ISSN: 2224-2864
249
Volume 21, 2022