Eye detection has become a hot topic in the computer

vision and the pattern recognition ﬁeld during the recent

period [1], [2], because the locations of human eyes are

crucial for a variety of applications, such as psychology, facial

expression identiﬁcation, medicine, and auxiliary driving, [3].

Yet, practically, eye detection is rather difﬁcult. Since cameras

are sensitive to the ﬂuctuations of light and the distance of

shooting, human eyes in a facial image prove to be quite

eccentric. Sometimes face is partially obscured which affects

certain existing eye detection methods that depend on facial

model detection [4]. An eye detector has also to perform

effectively in a variety of image modalities, such as visible and

infrared images. Furthermore, the eye identiﬁcation method

has to be quick since it is expected to be used online in various

scenarios. Even though, several approaches for detecting the

eyes from facial images have been presented, ﬁnding the

approach that best performs regarding accuracy, reliability, and

efﬁciency is tough. As a result, we aim to create an efﬁcient

and robust eye detection method. This work is a part of the

gaze tracker project for Autism Spectrum Disorder (ASD)

Diagnosis. The organization of the remainder of the paper will

be as follows. A review of state of the art methods is presented

in section 2 Related Work. In section 3, the Proposed Method

is found; we present a method for eye detection that includes

the generation of the candidate region, eye region identiﬁcation

and detection. Then, in Section 4, dataset, the achieved results,

evaluation and discussions are exposed. Finally, in the last

section, conclusions are presented.

Image-based eye detection algorithms are classiﬁed into

three types: classical or traditional eye detection methods,

machine learning eye detection methods, and deep learning eye

detection methods. The classical eye detection methods which

are based on the geometric characteristics of the eye can be

categorized into two subcategories; the geometric model and

the template matching. In the ﬁrst subclass Valenti and Gevers

[5] designed a voting mechanism for the localization of the eye

pupil based on the curvature of isophotes. Markus et al. [6]

suggested an ensemble of randomized regression trees-based

approach for eye pupil localization. To detect the pupils, Timm

and Barth [7] proposed a method based on the curvature of

isophotes and squared dot products.

While in the second subclass, the RANSAC [8] technique

was used to construct an elliptic equation to ﬁt the pupil

center. Based on correlation ﬁlters, Araujo et al. [9] talked

about an Inner Product Detector for the localization of the eye.

Although the traditional eye detectors can sometimes prove

their efﬁciency, they do not perform with the same efﬁciency

when the external light or facial occlusion changes. Machine

learning eye detectors have two essential bases which are the

feature extraction that is followed by a cascaded classiﬁer. For

quick classiﬁcation, Chen and Liu [10] used wavelet technique

(Haar) and a support vector machine (SVM). To create an

efﬁcient eye detector, Sharma and Savakis [11] suggested

a method based on oriented gradients (HOG) features in

learning histogram combined with SVM classiﬁers. For eye

center detection, Leo et al. [12], [13] opted for self-similarity

information paired with shape analysis. D’orazio et al. [14]in-

troduced a geometrical parameter restriction and a neural

classiﬁer meant to identify eye regions. For simultaneous

eye localization and eye state assessment, Gou et al. [15]

developed a cascade regression framework. Kim et al. [16]

proposed using multiscale iris shape characteristics to generate

eye candidate regions, which would then be veriﬁed using

HOG descriptor and the features of mean intensity. Since deep

learning methods have become popular [17], some researchers

have employed (Convolution Neural Network) CNN for train-

ing eye detectors, and this is what the second subcategory con-

sists in. A CNN-based pupil center identiﬁcation approach is

presented by Chinsatit and Saitoh in [18]. In Fuhl’s [19] study,

which employed two similar convolutional neural networks to

Eye Localisation using Cascaded U-Net for Autism Spectrum Disorder

1,2INES RAHMANY, 1SABRINE BRAHMI, 2NAWRES KHLIFA

1Faculty of Sciences and Techniques, University of Kairouan, Kairouan, TUNISIA

2Universit´e de Tunis El Manar, Laboratoire Biophysique et Technologies Medicales, ISTMT, Tunis,

TUNISIA.

Abstract: Many computer vision applications Computer Aided Diagnosis require an accurate and efficient eye

detector. We represent, in this work, an efficient approach for determining the position of the eye in images

presenting faces. First, a series of candidate regions are proposed; next, a set of cascaded U-net is used to detect

the eye regions; then edge detection methods are used to detect eyes lids and iris boundary and thus helping

determining the gaze direction of person having ASD. Our proposed approach achieved a precision of detection

that is better than the current most recent methods in trials utilizing GI4E, BioID, and columbiaGaze datasets.

The proposed approach is robust to picture alterations, such as changes in external illumination, facial occlusion,

and image modality

Keywords: ASD, Eye Detection, Segmentation, Deep Learning , U-net.

Received: March 21, 2021. Revised: April 17, 2022. Accepted: July 10, 2022. Published: Augustu 4, 2022.

1. Introduction

2. Related Work

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.20

Ines Rahmany, Sabrine Brahmi, Nawres Khlifa

E-ISSN: 2224-3488

140

Volume 18, 2022

distinguish between coarse and ﬁne pupil location, the authors

suggested that subregions be computed using a downscaled

input picture. Amos et al. [20] used 68 feature points to train

a facial landmark detector to deﬁne the face model, with 12

feature points describing the eye shape. When compared to

traditional methods, deep learning-based methods have proved

to be more resilient and accurate. However, efﬁciency remains

a problem. Facial images are typically larger than 640 x

480 pixels. Only the chosen candidate areas are entered into

the CNNs, hence it is crucial to propose candidate regions

quickly and constructively. The classiﬁcation of the left and

right eyes and the eye center location are also important in

some applications, such as eye tracking systems and illness

detection. On the other hand, the majority of eye detectors

that are used nowadays are unable to identify the eye regions,

split apart the left and right eyes, and identify the center of

the eye in an one round.As a result, our goal is to provide

an innovative way that presents a solution for the problems

present in the old methods. The Figure presents a taxonomy

of eye detection methods.

Fig. 1: Taxonomy of eye detection methods.

Figure 2 presents the overall ﬂowchart of the proposed 4

phases method using cascaded U-net. In the ﬁrst phase, in the

complete facial image we swiftly produce a number of eye

region candidates. The second U-net examined these possible

eye regions in the second stage to identify the regions of the

eye and the eye lids. The third U-net is used to pinpoint the iris

in the third phase. Then edge detection methods are used to

detect eyes lids and iris boundary and thus helping determining

the gaze direction of person having ASD.

The architecture of the U-net is shown in Figure 3. It

consists of an expanded path on the right and a contract-

ing path on the left (left side). The contracting route of

the convolutional network obeys the standard architecture.

It consists of two 3x3 convolutions (unpadded convolutions)

that are performed twice, with a rectiﬁed linear unit (ReLU)

applied after each one, and a 2x2 max pooling action and

downsampling with stride 2. The number of feature channels

is quadrupled with each down- sampling step. After the

Fig. 2: The overall ﬂowchart of the proposed method.

upsampling of the feature map a 2x2 convolution is found (up-

convolution) which halves the number of feature channels, two

3x3 convolutions, each followed by a ReLU in the expanding

route, and a concatenation with the proportionately cropped

feature map from the contracting path. Cropping is necessary

since every convolution results in the loss of border pixels.

In order to convert each 64-component feature vector to the

proper number of classes, a 1x1 convolution is used as the

ﬁnal layer. There are a total of 23 convolutional layers in the

network.

Fig. 3: The U-net architecture.

In this work, we consider a public database of images

of people: CGDS (Columbia Gaze Data Set). CGDS and a

publicly available gaze dataset: 5,880 images of 56 people

(32 males, 24 females) presenting various gaze directions and

poses of the head, and each image has a resolution of 5184 x

3456 pixels. For each subject, there are 5 head poses and 21

gaze directions per head pose at the time of its publication.

The presented subjects come from different ethnicities and 21

of them wore glasses. A sample of columbia Gaze dataset is

presented in Figure 4.

Every combination of ﬁve horizontal head postures (0°,

±15°, ±30°), seven horizontal gaze directions (0°, ±5°, ±10°,

±15°), and three vertical gaze directions (0°, ±10°) resulted

in the acquisition of one picture for each individual. Table I

presents the attributes of the images used in the dataset.

Figures in this subsections display some qualitative results

of our proposed cascaded U-net method.

3. The Proposed Methodology

4. Results and Evaluations

4.1. Dataset

.2. Eye Detection Results

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.20

Ines Rahmany, Sabrine Brahmi, Nawres Khlifa

E-ISSN: 2224-3488

141

Volume 18, 2022

Fig. 4: Sample of Columbia Gaze dataset [21].

TABLE I: CGDS dataset attributes.

Number of subjects: 56

Gaze targets by subject

and head pose:

Fixed head poses : 5

Head Pose Calibration : oui

Resolution (px): 5184 * 3456

Total images: 5,880

1) Phase 1 Results : Extraction of Region of Interest: The

1st U-net inputs and outputs for the extraction of region of

interest are given in Figures 5 and 6.

Fig. 5: Input Images of the U-net 1.

2) Phase 2 Results : Detection of eye lids: We use eye

regions of interest provided by the U-net 1 as inputs (Figure

7) to second U-net which segments and detects (locate) the

ocular eye area (Figure 8). Sobel ﬁlter is applied in order to

detect the eye lids (Figure 9)

3) Phase 3 Results : Iris boundary detection: We use the

ocular eye area provided by the U-net 2 as inputs (Figure 10)

to third U-net which segments and detects (locate) the iris

(Figure 11). Sobel ﬁlter is applied in order to detect the iris

boundary (Figure 12).

4) Phase 4 Results : Eye regions Detection: In the last step,

we made the fusion of U-Net2 and U-Net3 which results of the

localization of different eye regions (Figure 13), thus helping

determining the gaze direction.

We presented here the sets of U-nets outputs for detecting

and segmenting ocular areas in order to locate the eye. Our

approach is effective and demonstrates that it can successfully

detect eye positions. visible. We used the normalized error to

test the eye classiﬁcation and detection capabilities,

Our proposed method was tested on Google Colab environ-

ment. It is a free cloud-based online environment for Jupyter

notebooks, to train our deep learning and machine learning

models on GPUs. The speciﬁcations used runtime offered by

Google Colab is presented in the Table II.

TABLE II: The speciﬁcations of runtime offered by Google

Colab.

GPU Up to Tesla K80 with 12

GB of GDDR5 VRAM,

Intel Xeon Processor with

two cores @ 2.20 GHz

and 13 GB RAM

The outputs of the three stages of cascaded U-nets were

examined individually during the evaluation on the Columbia

Gaze dataset. For the U-nets’ training and testing, we ran-

domly separated the dataset into two portions. we randomly

choose 80% of images from each class as training set, while

the rest (20%) are used for test.

Currently, there are no analytical approaches to design the

parameters in U-net networks. Therefore, to get an optimal

result from the algorithm, we need to ﬁnd the best parameters

empirically. The model parameters are presented in Table III.

TABLE III: Model Parameters.

Nb Total Images 5880

Nb Train Images 4 704

Nb Test Images 1 176

Batch size 16

Epochs 16

Optimizer ADAM

Activation softmax

Figure 14 shows that the two curves of loss and accuracy

with an epoch number equal to 16 is near the boundary of

the rectangle. The accuracy is 98.61%. This shows that the

proposed model presents high performance (a good image

segmentation model for eye detection).

We examined the average time of each phase. The Table

IV below presents the overall performance of the proposed

method.

TABLE IV: Results of the eye localization phases by the

proposed cascaded U-net model.

Proposed method Accurracy Execution times

U-Net 1 100% 18s

U-Net 2 95.19% 11s

U-Net 3 97.61% 8s

The comparison between the suggested method and state of

the art techniques used on the BioID dataset and Columbia

Gaze dataset is shown in Table V.

.3. Evaluation

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.20

Ines Rahmany, Sabrine Brahmi, Nawres Khlifa

E-ISSN: 2224-3488

142

Volume 18, 2022

Fig. 6: Phase 1 Results : Extraction of Region of Interest.

Fig. 7: Input Images of the U-net 2.

Fig. 8: Ocular eye area.

Fig. 9: Phase 2 Results : Detection of eye lids.

Fig. 10: Input Images of the U-net 3.

[5] [9] [15]

In comparison to Valenti and Gevers [5] (86.1%), Araujo

et al. [9] (88.3%), and Gou et al. [15] (91.2%) approaches,

our suggested method performs better (97.61%). Additionally,

our suggested approach is robust even when the subject of the

picture test is wearing spectacles and does not need clustering.

The approaches used by Valenti and Gevers [5] and Araujo et

al. [9] are less reliable than ours.

In this paper, we developed an efﬁcient cascaded U-net

method for locating the eye in face photos. Images captured

using visible or infrared light have no impact on the suggested

strategy. Additionally, the face detector is not dependent on

the placement of the eyes. We tested our approach using more

than 5,000 facial photos for the evaluation and found that our

suggested eye detector was helpful and efﬁcient. To produce

. Conclusions

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.20

Ines Rahmany, Sabrine Brahmi, Nawres Khlifa

E-ISSN: 2224-3488

143

Volume 18, 2022

Fig. 11: Segmentation of the iris

Fig. 12: Phase 3 Results : Iris boundary detection

Fig. 13: Eye location results.

Fig. 14: Accuracy and loss curve of the proposed U-net model.

TABLE V: Comparison of the achieved results of the proposed

approach with other state-of-the-art methods.

Technique Databases Accurracy

George et al. [22] BioID: 1521 images

GI4E: 1380 images

94.74%

Yiu et al. [23] BioID: 1521 images 96%

Valenti et al. [5] BioID: 1521 images 86.1%

Araujo et al. [9] BioID: 1521 images 88.3%

Gou et al. [15] BioID: 1521 images 91.2%

Proposed method BioID: 1521 images 97.61%

good segmentation results and a noticeably high efﬁciency.

The gaze tracker project for the diagnosis of autism spectrum

disorder includes this study.

[1] H. Fu, Y. Wei, F. Camastra, P. Arico, and H. Sheng, “Advances in

eye tracking technology: theory, algorithms, and applications,” Com-

putational intelligence and neuroscience, vol. 2016, 2016.

[2] L. Zhang, Y. Cao, F. Yang, and Q. Zhao, “Machine learning and visual

computing,” 2017.

[3] A. H. Mosa, M. Ali, and K. Kyamakya, “A computerized method to

diagnose strabismus based on a novel method for pupil segmentation,”

2013.

[4] L. Birgit and M. Brodsky, “Pediatric ophthalmology, neuro-

ophthalmology, genetics: Strabismus-new concepts in pathophysiology,

diagnosis, and treatment,” 2010.

References

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.20

Ines Rahmany, Sabrine Brahmi, Nawres Khlifa

E-ISSN: 2224-3488

144

Volume 18, 2022

[5] R. Valenti and T. Gevers, “Accurate eye center location through invariant

isocentric patterns,” IEEE transactions on pattern analysis and machine

intelligence, vol. 34, no. 9, pp. 1785–1798, 2011.

[6] N. Markuˇ

s, M. Frljak, I. S. Pandˇ

zi´

c, J. Ahlberg, and R. Forchheimer,

“Eye pupil localization with an ensemble of randomized trees,” Pattern

recognition, vol. 47, no. 2, pp. 578–587, 2014.

[7] F. Timm and E. Barth, “Accurate eye centre localisation by means of

gradients.” Visapp, vol. 11, pp. 125–130, 2011.

[8] L. ´

Swirski, A. Bulling, and N. Dodgson, “Robust real-time pupil tracking

in highly off-axis images,” in Proceedings of the symposium on eye

tracking research and applications, 2012, pp. 173–176.

[9] G. M. Araujo, F. M. Ribeiro, E. A. Silva, and S. K. Goldenstein, “Fast

eye localization without a face model using inner product detectors,”

in 2014 IEEE International Conference on Image Processing (ICIP).

IEEE, 2014, pp. 1366–1370.

[10] S. Chen and C. Liu, “Eye detection using discriminatory haar features

and a new efﬁcient svm,” Image and Vision Computing, vol. 33, pp.

68–77, 2015.

[11] R. Sharma and A. Savakis, “Lean histogram of oriented gradients

features for effective eye detection,” Journal of Electronic Imaging,

vol. 24, no. 6, p. 063007, 2015.

[12] M. Leo, D. Cazzato, T. De Marco, and C. Distante, “Unsupervised

approach for the accurate localization of the pupils in near-frontal facial

images,” Journal of Electronic Imaging, vol. 22, no. 3, p. 033033, 2013.

[13] ——, “Unsupervised eye pupil localization through differential geometry

and local self-similarity matching,” PloS one, vol. 9, no. 8, p. e102829,

2014.

[14] T. D’orazio, M. Leo, and A. Distante, “Eye detection in face images

for a driver vigilance system,” in IEEE Intelligent Vehicles Symposium,

2004. IEEE, 2004, pp. 95–98.

[15] C. Gou, Y. Wu, K. Wang, K. Wang, F.-Y. Wang, and Q. Ji, “A

joint cascaded framework for simultaneous eye detection and eye state

estimation,” Pattern Recognition, vol. 67, pp. 23–31, 2017.

[16] H. Kim, J. Jo, K.-A. Toh, and J. Kim, “Eye detection in a facial image

under pose variation based on multi-scale iris shape feature,” Image and

Vision Computing, vol. 57, pp. 147–164, 2017.

[17] D. Xie, L. Zhang, and L. Bai, “Deep learning in visual computing

and signal processing,” Applied Computational Intelligence and Soft

Computing, vol. 2017, 2017.

[18] W. Chinsatit and T. Saitoh, “Cnn-based pupil center detection for

wearable gaze estimation system,” Applied Computational Intelligence

and Soft Computing, vol. 2017, 2017.

[19] W. Fuhl, T. Santini, G. Kasneci, and E. Kasneci, “Pupilnet: Convo-

lutional neural networks for robust pupil detection,” arXiv preprint

arXiv:1601.04902, 2016.

[20] B. Amos, B. Ludwiczuk, M. Satyanarayanan et al., “Openface: A

general-purpose face recognition library with mobile applications,” CMU

School of Computer Science, vol. 6, no. 2, 2016.

[21] B. Smith, Q. Yin, S. Feiner, and S. Nayar, “Gaze Locking: Passive Eye

Contact Detection for Human?Object Interaction,” in ACM Symposium

on User Interface Software and Technology (UIST), Oct 2013, pp. 271–

280.

[22] A. George and A. Routray, “Fast and accurate algorithm for eye

localisation for gaze tracking in low-resolution images,” IET Computer

Vision, vol. 10, no. 7, pp. 660–669, 2016.

[23] Y.-H. Yiu, M. Aboulatta, T. Raiser, L. Ophey, V. L. Flanagin, P. Zu Eu-

lenburg, and S.-A. Ahmadi, “Deepvog: Open-source pupil segmentation

and gaze estimation in neuroscience using deep learning,” Journal of

neuroscience methods, vol. 324, p. 108307, 2019.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the Creative

Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en_US

WSEAS TRANSACTIONS on SIGNAL PROCESSING

DOI: 10.37394/232014.2022.18.20

Ines Rahmany, Sabrine Brahmi, Nawres Khlifa

E-ISSN: 2224-3488

145

Volume 18, 2022