Hyperparameter Tuning for Address Validation using Optuna

MARIYA EVTIMOVA

UFR27,

Paris 1 Sorbonne- Pantheon, CRI,

90 Rue de Tolbiac, 75013 Paris,

FRANCE

Abstract: - Public institutions generally share personal information on their websites. That allows the

possibility to find personal information when performing internet searches quickly. However, the personal

information that is on the internet is not always accurate and can lead to misunderstandings and ambiguity

concerning the accessible postal address information. That can be crucial if the information is used to find the

location of the corresponding person or to use it as a postal address for correspondence. Many websites contain

personal information, but sometimes as people change the web address, information is not up to date or is

incorrect. To synchronize the available personal information on the internet could be used an algorithm for

validation and verification of the personal addresses. In the paper, a hyperparameter tuning for address

validation using the ROBERTa model of the Hugging Face Transformers library. It discusses the

implementation of hyperparameter tuning for address validation and its evaluation to achieve high precision

and accuracy.

Key-Words: - Hyperparameter Tuning, Hugging Face Transformers Library, Optuna, Machine Learning

algorithm, Address validation, ROBERTa model, Postal address

1 Introduction

The proliferation of digital information on the

internet is experiencing exponential growth, with

numerous websites, particularly those affiliated with

public institutions, disseminating personal data.

Nonetheless, it's important to note that the data

obtained from diverse internet sources is not always

correct, [1]. An increasing number of individuals are

turning to online resources to access pertinent

address-related information, as the internet's

unhindered accessibility facilitates swift

communication, [2]. However, this accessibility

carries the inherent risk of encountering outdated

information, potentially leading to

misinterpretations and severed connections with

intended recipients, [3]. Thus, the accurate

acquisition of address information stands as pivotal

for effective communication.

It's crucial to emphasize that the composition of

postal address entries displays greater diversity

compared to conventional descriptors, [4], [5].

Conversely, not all components of an address are

essential for practical use. For instance, the provided

address above notably excludes the "Postal code"

entry. Similarly, instances exist where suite details

are absent. A significant challenge in verifying

internet-sourced addresses pertains to the intricacies

of vocabulary usage. Language nuances can breed

semantic confusion due to the presence of synonyms

(words with identical meanings) and polysemy (a

single word with divergent meanings).

As an example, the term "Avenue" might be

abbreviated variably as "Av.," "av," or "Ave."

Further exemplifying polysemy, the street name

"rue de Tivoli" is present in both Marseilles and

Paris, illustrating this phenomenon, [6].

It's pertinent to acknowledge that the order of

address components differs among various

institutions. Notably, certain establishments

prioritize the inclusion of street numbers before

street names, while others position house addresses

after the street name.

Complications arising from erroneous address

databases, characterized by duplicate entries and

inconsistent variants, give rise to a complex

landscape where accurate data retrieval becomes

both arduous and unreliable. The data classification

encompassing names and addresses is underscored

by idiosyncratic attributes that contribute to distinct

complexities in their management. This data domain

is particularly susceptible to volatility due to the

dynamic nature of institutional affiliations, address

changes, and name modifications. Moreover, data

input for names and addresses frequently exhibits

Received: June 2, 2022. Revised: August 29, 2023. Accepted: October 5, 2023. Available online: November 13, 2023.

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.10

Mariya Evtimova

E-ISSN: 2415-1521

105

Volume 12, 2024

cluttered tendencies, as front-end interfaces allow

free-form entry, often incorporating comments and

additional data without validation.

Further complicating matters is the subjective

nature inherent in the representation of names and

addresses. Individuals possess the liberty to express

identical entities in varied ways, despite referring to

the same entity. Unfortunately, a universally

accepted standard or framework that could

encapsulate name and address data while

simultaneously evaluating its quality remains

conspicuously absent. This challenge is

compounded by the intricate interplay of cultural

contexts within France, which significantly

influence the interpretation and management of

name and address data.

2 Related Work

2.1 The Sections Present Recent Relative

Research of the ROBERTA Language

Model Applied in Address Verification

In the study conducted by, [7], a deep learning

methodology is introduced to enhance the quality of

global address data utilized for imported food safety

management. The proposed approach involves the

classification of user input addresses into

administrative divisions specific to the respective

country. By transforming the addresses into

standardized formats, the quality of the address data

is assessed and enhanced.

Within the context of research, [8], a filtering

approach named Distill-AER has been put forth.

This technique is designed to facilitate the

transformation of knowledge extracted from a well-

populated labeled dataset of standard addresses in

the realm of Big Data. The objective is to adapt this

knowledge for the targeted task of recognizing

entities within addresses that hold a specific

significance. To facilitate this transfer, a labeled

spoken dialogue dataset containing address entities

is constructed through the utilization of the data

augmentation paradigm.

The study, [9], introduces a two-stage address

validation methodology that incorporates

standardization and classification steps, both

leveraging the ROBERTa pre-trained language

model. The proposed approach is evaluated through

experiments conducted on real datasets, showcasing

its effectiveness and reliability.

2.2 Other Machine Learning Techniques

Applied for Address Validation

The study, [10], presents HyPASS, a software

detailed in their study, which encompasses a hybrid

approach combining Software-Defined Networking

(SDN) principles with host discovery and address

validation techniques. The primary objective of

HyPASS is to mitigate source spoofing attacks by

enhancing network security measures.

The paper by, [11], introduces an automated

probabilistic method, based on a hidden Markov

model (HMM), that utilizes national address

guidelines and an extensive national address

database. This approach aims to process raw input

addresses by performing cleaning, standardization,

and verification tasks.

The study, [12], introduces a novel robust

architecture called DeepParse in their paper, which

is specifically designed for postal address parsing.

This architecture implements address parsing and

also reflects Named Entity Recognition (NER)

problems. DeepParse treats input data at various

levels such as characters, trigram characters, and

words, to extract features and carry out address

validation. The model was trained using a

synthetically generated dataset and subsequently

tested with real addresses.

In the research carried out by, [13], an

application of one-dimensional transformation in

CNN (Convolutional Neural Networks) is utilized

for address parsing.

The results obtained from the evaluation

demonstrate a high accuracy for a labeled dataset

comprising nearly 20,000 samples. Notably, the

proposed network architecture possesses a scalable

nature, eliminating the need for any pre- or post-

processing stages.

The authors in, [14], introduce a tool designed

for the annotation of electronic health records. The

research focuses on training random forests to

identify patients who are experiencing

homelessness. To validate the efficacy of each

model, a 10-fold cross-validation technique is

employed.

3 Standards for Postal Address in

France

3.1 French Standard Overview

In France, postal addresses follow a specific format.

Here is the standard structure for a postal address in

France:

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.10

Mariya Evtimova

E-ISSN: 2415-1521

106

Volume 12, 2024

● Recipient's Name: Full name of the person or

organization to which the mail is addressed.

● Building Number and Street Name: The

building number comes before the street name.

For example, "12 Rue de la Paix" (12, Peace

Street).

● Postal Code and City: The five-digit postal code

comes before the name of the city. For example,

"75001 Paris" (75001 being the postal code for

the 1st arrondissement of Paris).

● Cedex (Optional): If the recipient's address

includes a Cedex (Courrier d'Entreprise à

Distribution Exceptionnelle) number, it should

be included on a separate line below the postal

code and city. Cedex is used for mail addressed

to specific companies, organizations, or

government agencies that have their distribution

system. For example, "Cedex 2" or "CEDEX 2".

● Country (Not required for domestic mail)

If the mail is being sent from outside of France,

the country name should be included in uppercase

letters at the last line of the address. For domestic

mail within France, the country name is not

necessary.

An example of a postal address in France:

Monsieur Jean Dupont

12 Rue de la Paix

75001 Paris

France

It's important to note that the exact format may

vary slightly depending on the region or specific

requirements of the local postal service, [15].

It is prudent to consistently cross-validate

prevailing norms and directives set forth by La

Poste, the national postal service of France.

Alternatively, one may opt to engage in dialogue

with the intended recipient to ascertain any

supplementary details that could potentially be

deemed requisite.

3.2 Proposal for Address Model

Fig. 1 introduces an address denoted as Adr = {adr

1, ..., adr n}, where adr signifies a collection of

addresses adri, and 'i' denotes a word, while 'n'

signifies the address's length. The primary objective

of parsing Adr is to allocate a label 'l' to each word

adri within Adr from the corresponding set of

address tags denoted as T;

T = {P, C, CO, SR, BN, CE, SN, PB}.

These tags find their definitions in the address

model depicted in Figure 1, whereby the address

tags symbolize composite attributes.

Fig. 1: Address model

There are also several commercial paid

applications and services that could be used for the

verification of postal addresses like la Poste API,

Google Maps Geocoding API, SmartyStreets,

Experian Data Quality, and Loqate.

4 Overview of the Tasks Involved in

Hyperparameter Tuning

1. Data Preparation:

Collection and preprocessing of the proposed

datasets for address validation (French BAN corpus

and French higher education).

This step includes:

● cleaning and labeling data with both correct

and incorrect addresses

● splitting the data into training, validation,

and test sets.

2. Pre-trained Model Selection:

Hugging Face's ROBERTa model is chosen for the

address validation task.

3. Tokenization:

Choosing an appropriate tokenization strategy for

the data address validation task.

4. Model Architecture:

Include fine-tuning the ROBERTa model for

address validation.

5. Hyperparameter Tuning:

Hyperparameters tuning can include modification of

the following parameters, [14]:

● Learning Rate: parameter essential for

model convergence between 1e-5 and 1e-3.

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.10

Mariya Evtimova

E-ISSN: 2415-1521

107

Volume 12, 2024

Techniques such as learning rate schedules

can be applied.

● Batch Size: adjustment of the batch size

concerning the computer hardware

limitations and the size of the data.

Normally, the smaller batch sizes

correspond to smaller learning rates.

● Number of Epochs: determination of the

appropriate number of epochs through

experimentation and avoiding overfitting

● Weight Decay: regularization of the

parameter to control overfitting.

● Warmup Steps: implementation of warm-up

steps for the learning rate scheduler.

● Gradient Accumulation: useful parameter

for limited GPU memory.

● Early Stopping: This can be used to avoid

overfitting.

● Loss Function: choosing or customization

of the loss function that is suitable for

address validation. It should reflect the

nature of your task, possibly considering

token-level or sequence-level metrics.

● Evaluation Metrics: definition of evaluation

metrics that are relevant for address

validation, such as accuracy, F1 score, or

other custom metrics.

6. Other machine learning techniques that

could be applied, [16], [17]:

● Regularization: application of dropout or

other regularization techniques to prevent

overfitting.

● Fine-tuning Strategy: experimentation with

different fine-tuning strategies. Techniques

like gradual unfreezing of layers or

differential learning rates can be used.

● Data Augmentation: augmentation of the

dataset with variations of addresses, to

improve model robustness.

● Cross-Validation: performance of k-fold

cross-validation to assess the model's

generalization and identify optimal

hyperparameters.

● Grid search or Random search: consider

using grid search or random search to

automate hyperparameter tuning.

● Monitoring and Logging: implementation of

a system to monitor and log the training

process and results, allowing tracking of the

performance of different hyperparameters.

● Hardware and Parallelism: depending on the

resources available, it is possible to explore

distributed training to speed up the tuning

process.

● Deployment and Inference:

deployment of the model for inference in a

production environment.

5 Description of the Algorithm for

Hyperparameter Tuning using

Optuna

Figure 2 describes the algorithm for hyperparameter

tuning that is proposed for address validation.

Fig. 2: Pseudo-code of the algorithm for

hyperparameter tuning for address validation

6 Experiments and Results

The last phase of the model defined in Figure 2 is

described in this section.

Currently, the BAN contains 25 million

addresses across France. The parsing and

classification of the dataset were conducted using

two real-world datasets.

The first dataset is the BAN, which comprises

millions of structured addresses extracted from the

French database. The second dataset is a collection

of 3,683 structured addresses from a French higher

education database.

● The French BAN corpus, comprising 25

million addresses extracted from the

database.

● The French higher education dataset

encompasses a collection of 3,683

addresses.

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.10

Mariya Evtimova

E-ISSN: 2415-1521

108

Volume 12, 2024

Fig. 3: Evaluation of the datasets

From the data provided from the evaluation

represented in Figure 3, it is possible to conclude

that the results when using Dataset 2 have high

precision and F- measure.

Fig. 4: Comparison of the evaluation results when

applying fine-tuning and hyperparameter tuning.

Fig.4 is presented as a comparison of the results

with hyperparameter tuning and only with fine-

tuning that is presented in the article, [18]. The

results show better performance when applying

hyperparameter tuning with Optuna.

7 Conclusion

The increasing number of internet sites containing

personal information and addresses is constantly

growing on the internet, and this data needs to be

synchronized and updated, [18]. The websites

providing personal information are typically public

institutions’ websites. This information is generally

used by everyone to find addresses (for writing

letters or locating office coordinates).

These sites change their content frequently, and

the information needs to be updated daily to ensure

the online information is accurate, [19], [20]. This

has led to the development of an algorithm for

address verification using hyperparameter tuning

with Optuna, which can be used to align data from

various internet sources. The paper describes an

algorithm for address verification using the Optuna

with ROBERTa model, [21], [22], [23]. The

proposed algorithm achieves a 98.5\% accuracy rate

and a relatively high F-measure in comparison to the

other algorithms applied in address validation, [24],

[25], [26].

References:

[1] Cai, Wentao, Shengrui Wang, and Qingshan

Jiang. "Address extraction: Extraction of

location-based information from the web",

Web Technologies Research and

Development-APWeb 2005: 7th Asia-Pacific

Web Conference, Shanghai, China, March 29-

April 1, 2005. Proceedings 7. Springer Berlin

Heidelberg, 2005.

[2] Fedushko, Solomia, and Yuriy Syerov.

"Design of registration and validation

algorithm of member’s personal data",

International Journal of Informatics and

Communication Technology 2.2, 2013, pp. 93-

98.

[3] Dakrory, Sara, et al. "Extracting geographic

addresses from social media using deep

recurrent neural networks", 2021 9th

International Japan-Africa Conference on

Electronics, Communications, and

Computations (JAC-ECC). IEEE, 2021.

[4] Beverly, Robert, et al. "Understanding the

efficacy of deployed internet source address

validation filtering", Proceedings of the 9th

ACM SIGCOMM conference on Internet

measurement, 2009.

[5] Nagabhushan, P., Shanmukhappa A. Angadi,

and Basavaraj S. Anami. "Symbolic data

structure for postal address representation and

address validation through symbolic

knowledge base", Pattern Recognition and

Machine Intelligence: First International

Conference, PReMI 2005, Kolkata, India,

December 20-22, 2005. Proceedings 1.

Springer Berlin Heidelberg, 2005.

[6] U.S. POSTAL SERVICE FACILITIES:

Improvements in Data Would Strengthen

Maintenance and Alignment of Access to

Retail Services, GAO Report, December

2007:i-61. Accessed August 29, 2023.

[7] Soeng, Saravit, et al. "Deep Learning Based

Improvement in Overseas Manufacturer

Address Quality Using Administrative District

Data", Applied Sciences 12.21, 2022, vol.

11129.

[8] Wang, Yitong, et al. "Distill-AER: Fine-

Grained Address Entity Recognition from

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.10

Mariya Evtimova

E-ISSN: 2415-1521

109

Volume 12, 2024

Spoken Dialogue via Knowledge Distillation",

Natural Language Processing and Chinese

Computing: 11th CCF International

Conference, NLPCC 2022, Guilin, China,

September 24–25, 2022, Proceedings, Part I.

Cham: Springer International Publishing,

2022.

[9] Guermazi, Yassine, Sana Sellami, and Omar

Boucelma. "A RoBERTa Based Approach for

Address Validation", New Trends in Database

and Information Systems: ADBIS 2022 Short

Papers, Doctoral Consortium and Workshops:

DOING, K-GALS, MADEISD, MegaData,

SWODCH, Turin, Italy, September 5–8, 2022,

Proceedings. Cham: Springer International

Publishing, 2022.

[10] Meena, Ramesh Chand, et al. "HyPASS:

Design of hybrid-SDN prevention of attacks

of source spoofing with host discovery and

address validation", Physical Communication

55, 2022, vol .101902.

[11] Christen, Peter, and Daniel Belacic.

"Automated probabilistic address

standardisation and verification.",

Australasian Data Mining Conference, 2005.

[12] Abid, Nosheen, Adnan ul Hasan, and Faisal

Shafait, "DeepParse: A trainable postal

address parser.", 2018 Digital Image

Computing: Techniques and Applications

(DICTA), IEEE, 2018.

[13] Delil, Selman, et al. "Parsing Address Texts

with Deep Learning Method", 2020 28th

Signal Processing and Communications

Applications Conference (SIU), IEEE, 2020.

[14] Erickson, Jennifer, Kenneth Abbott, and

Lucinda Susienka, "Automatic address

validation and health record review to identify

homeless Social Security disability

applicants.", Journal of Biomedical

Informatics, vol.82, 2018, pp. 41-46.

[15] YANG, Li; SHAMI, Abdallah, “On

hyperparameter optimization of machine

learning algorithms: Theory and practice.”,

Neurocomputing, 2020, vol.415, pp. 295-316.

[16] Akiba, Takuya, et al., "Optuna: A next-

generation hyperparameter optimization

framework", Proceedings of the 25th ACM

SIGKDD international conference on

knowledge discovery & data mining, 2019, p.

2623-2631.

[17] Andonie, Răzvan. Hyperparameter

optimization in learning systems. Journal of

Membrane Computing, 2019, 1.4, pp. 279-291

[18] Evtimova, M., “Validation algorithm for

aligning postal addresses available on the

Internet”, MACISE conference, 2023.

[19] Basu, Subhadip, et al. "A novel framework for

automatic sorting of postal documents with

multi-script address blocks." Pattern

Recognition 43.10, 2010, pp.3507-3521.

[20] Andonie, Răzvan. Hyperparameter

optimization in learning systems. Journal of

Membrane Computing, 2019, 1.4, pp. 279-

291.

[21] Lewis, Taylor, Joseph McMichael, and

Charlotte Looby, "Evaluating Substitution as

a Strategy for Handling US Postal Service

Drop Points in Self-Administered Address-

Based Sampling Frame Surveys."

Sociological Methodology 53.1, 2023, pp.

158-175.

[22] De, Shankkha, and Dipti Verma. "Deep

Convolutional Transfer Learning approach for

Bengali handwritten character recognition

from document image." Science and Culture,

2023.

[23] Wolf, Thomas, et al. "Huggingface's

transformers: State-of-the-art natural language

processing." arXiv preprint

arXiv:1910.03771, 2019.

[24] Jain, Shashank Mohan. "Tasks Using the

Hugging Face Library." Introduction to

Transformers for NLP: With the Hugging

Face Library and Models to Solve Problems.

Berkeley, CA: Apress, 2022, pp.69-136.

[25] Ushio, Asahi, and Jose Camacho-Collados.

"T-NER: an all-round python library for

transformer-based named entity recognition."

arXiv preprint arXiv:2209.12616, 2022.

[26] Kayed, Mohammed, Sara Dakrory, and

Abdelmaged Amin Ali. "Postal address

extraction from the web: a comprehensive

survey." Artificial Intelligence Review 55.2,

2022, pp.1085-1120.

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.10

Mariya Evtimova

E-ISSN: 2415-1521

110

Volume 12, 2024

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghost-writing

Policy)

I am the only author and contributor of this research.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

No funding was received for conducting this study.

Conflict of Interest

The author has no conflicts of interest to declare.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2024.12.10

Mariya Evtimova

E-ISSN: 2415-1521

111

Volume 12, 2024