Development of Annulus-Object Random Bin Picking System based on

Rapid Establishment of RGB-D Images

BO-RUI ZHU1, JIN-SIANG SHAW1, SHIH-HAO LEE2

1Institute of Mechatronic Engineering,

National Taipei University of Technology,

Taipei,

TAIWAN

2Institute of Manufacturing Technology

National Taipei University of Technology

Taipei,

TAIWAN

Abstract: - With the development of the automation industry, robotic arms, and vision applications are no longer

limited to fixed actions of the past. Production lines increasingly require the recognition and grasping of objects in

complex environments, emphasizing quick setup and stability. In this paper, a rapidly constructed eye-hand system for

robotic arm grasping, which enables fast and efficient object manipulation, particularly for stacking, is introduced.

Initially, images were captured using a camera to generate extensive datasets from a limited number of images.

Objects were subsequently segmented and categorized using deep learning networks for object detection and instance

segmentation. Three-dimensional position information was obtained from an RGB-D camera. Finally, object poses

were determined based on plane normal vectors, and gripping positions were manually marked. This reduced the time

required for grab-point identification, model training, and pose localization. Based on experimental results, the

grasping procedure proposed in this paper is suitable for various object-grasping scenarios. It achieved impressive

picking success rates of 96% for unstacked annular objects and 90.86% for random bin annular objects, respectively.

In the final experiment, following depth information filtering, a success rate of 95.1% was attained with random bin

annular object picking.

Key-Words: - Robot Manipulator, Random Bin Picking, Deep Learning, Instance segmentation, Grasping Strategies,

Collaborative Robotics.

Received: May 21, 2023. Revised: December 22, 2023. Accepted: January 16, 2023. Published: February 28, 2024.

1 Introduction

Object grasping is an important research topic in the

field of robotics. Random sorting (RBP) is an important

research direction for robotic arm grasping, and it

generally needs to be applied in the industry. Although

this action is simple for humans, it is challenging for a

robotic arm. Owing to the random stacking of objects,

many occlusions exist in the box, and several problems,

such as collisions between the robotic arm, object, and

box, increase the difficulty of clamping. However, with

recent advances in machine vision and artificial

intelligence, this type of research has become

increasingly reliable for practical applications. Many

companies have actively invested in this development.

Among the traditional methods, the triangulation

laser measurement method was employed to generate

two distance images under a dual laser system,

synthesized unobstructed distance images within the

two distances and used each beam of light reaching the

surface of the object via the laser transmitter. The

geometric relationship between the points was used to

obtain the three-dimensional (3D) information of the

point cloud on the surface of the object. The point

cloud information was combined with the sample

consensus (SAC) and iterative closest point (ICP)

algorithms to transform the object point cloud. This

information was matched with a computer-aided design

(CAD) model of the object to assess its pose and

determine the grip points, [1]. Other researchers

utilized a CAD model and a depth image from an

RGB-D camera, converting them into a scene point

cloud, [2]. A growing number of recent algorithms use

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

128

Volume 21, 2024

deep learning convolutional neural networks (CNN) for

sampling and evaluation, such as that proposed by

Pinto and Gupta. Through evaluation, they obtained the

optimal clamping position and clamping posture, [3].

Various contact mechanics models, including those

introduced in previous studies, are also available for

evaluation, [4]. Similarly, used a 3D model to

reconstruct a 3D scan from multiple views and

simulated an object model to obtain a synthetic dataset.

Finally, the object was cut out using the R-CNN to

obtain the object point cloud information. The

algorithm calculated the gripping position, [5]. The

above goals were to obtain correct identification and

clamping stability and to achieve a high clamping

success rate. However, these methods require the

creation of a large number of images and labeled data;

therefore, they are relatively time-consuming and

costly. However, in certain applications, setting up the

clamping process quickly is desirable.

In this paper, an object-grabbing program that can

be built quickly is introduced. It was expected to grab

annulus-shaped objects, commonly found in factories;

however, tape was used to test the reusability of

subsequent objects. First, the clamping position of the

object was preset, and the center point of the object was

expected to be the clamping position. The desired

object was cropped from the RGB image using deep

learning, and the 3D position of the object was

obtained through algorithms and camera depth

information. Subsequently, the posture and position of

the object relative to the camera were obtained from the

3D information of the object, and the information

obtained by the camera was converted into the

information required by the robotic arm, such that the

robotic arm can obtain the grasping position the object

and the grasping posture of the arm. Objects can also

be captured.

We accelerated the data collection and data labeling

time for deep learning and did not use methods, such as

generative grasp convolutional neural networks

(GGCNN) for grasp visualization, [6] or [7], where a

hybrid deep structure of visual and tactile sensing using

a reference rectangle method to identify graspable

regions of an image was proposed. Various methods,

used the powerful learning ability of large

convolutional networks to predict the global grasp of a

complete image of an object, [8]. ResNet50 was used

for feature extraction, [9]. Although the accuracy of

ResNet50 is good, its speed is low. We chose a lighter

Inception model architecture and only used depth

information to compute the grasped objects. This

resulted in a fast and low-cost grasping system.

2 System Description

The system, hardware, and software architectures are

introduced in this section.

2.1 System Architecture

Figure 1 shows the architecture of the robotic arm

system. It is divided into the following parts:

preprocessing, hardware calibration, image recognition,

information integration, and communication. First, we

set the object to be grasped and its grasping point. In

this study, the center point of the tape was spread out

with jaws for gripping. The hardware was then

corrected, including image correction processing for

the RGB-D camera and hand-eye correction for the

robotic arm and camera. Next, deep learning was used,

which included fast data collection and annotation.

Finally, RGB-D cameras were combined to complete

object recognition.

In the follow-up, we used a calculation method

designed to integrate and calculate the RGB-D 3D

information obtained from image recognition to

determine the grasping posture of the robotic arm.

Finally, the entire grasping process was completed

through communication between the PC and the robotic

arm.

Fig. 1: System architecture

2.2 Hardware Architecture

The main hardware used in this study included a robotic

arm (Kuka iiwa 7 R800), an RGB-D camera (Kinect

V2), and a controller (TOYO CHS2-S40), as shown in

Figure 2. The robotic arm has seven degrees of freedom

and exhibits high flexibility, fast moving speed, and

high efficiency. The controller was mounted on the

flange of the robot arm and controller using a self-

designed fixture and gripper, respectively. The robotic

arm and RGB-D camera had an eye-to-hand

architecture.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

129

Volume 21, 2024

Fig. 2: Robotic painting system: (1) robotic arm, (2)

RGB-D camera, (3) controller, (4) gripper, (5) grab

object

2.3 Software Architecture

In this study, multiple hardware and software techniques

were combined. Depending on technical requirements,

selecting an appropriate development platform and

communication method is important. Python 3.0 was

used for deep learning, image recognition, and data

processing. Java was used to compile the robotic arm

program. The RGB-D camera and computer only used

USB 3.0 for connection, and the robot arm and gripper

used EtherCAT, which is an easy-to-configure

automation system, as a communication protocol.

To enable system communication, integrating the

communication between the controller, sensor, and

actuator software is important. In this study, the host

system consisted of a personal computer (client) and a

robot (server). The personal computer was the client,

and the results of the deep learning and image

recognition were converted by calculation and sent to a

robotic arm on the server. Figure 3 shows the

communication architecture of the system.

Fig. 3: System communication

3 Methodology

The research methodology is described in this section in

four parts: system integration, deep learning, grasping

position calculation, and robotic grasping control.

3.1 System Integration

3.1.1 Image Correction

Since RGB-D cameras have image distortion

problems, we referred to the correction method

proposed in [10], and then used the Zhang correction

method proposed at the International Conference on

Computer Vision (ICCV). This method uses a plane

checkerboard for camera calibration, [11], overcomes

the shortcomings of the high-precision 3D correction

required by the photographic correction method, and

solves the problem of poor robustness of the self-

correction method. In this study, the self-printing

checkerboard was adopted for calibration for

convenience, and 20 checkerboard photographs were

captured from different directions as shown in Figure

Fig. 4: Photographs of the checkerboard captured from

different directions

3.1.2 RGB-D Image Matching

Since the color and depth lenses of the Kinect V2 are

not in the same position, the color image pixel size

was 1920 × 1080, whereas the depth image pixel size

was 512 × 424. Therefore, matching the color and

depth images was necessary. The same point in the

world coordinates was observed using the color and

depth lenses, as shown in Figure 5. The calibration

results are presented in Figure 6.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

130

Volume 21, 2024

Fig. 5. Schematic of the same point observed by color

and depth lens

Fig. 6. (1) Color image, (2) Depth image, (3) Image

matching resul

3.1.3 Hand-eye Correction of Robotic Arm and

Camera

In the vision of robotic arms, one of the key issues is

hand-eye correction, which is divided into eye-to-hand

and eye-in-hand corrections according to the definition

of its position fixed by the camera. In this study, an

eye-to-hand system architecture was adopted, that is,

the camera was installed outside the robotic arm, and

the bases of the camera and robotic arm were fixed in

a position. We considered the quick solution to the

classic problem of hand-eye correction. Many

researchers, [12], [13], [14], [15] have studied the

calculation of Formula (1). Various solutions exist to

this problem. Since the Open Source Computer Vision

Library (OpenCV) includes related open-source

libraries that are convenient for quick use and

calculation, we used the calculation solution proposed

in [12], to obtain the relative position of the robotic

arm and camera.   (1)

3.2 Deep Learning

3.2.1 Data collection

Owing to the need for subsequent deep learning,

collecting object image data for training deep learning

networks is necessary to identify objects. Here, a

Kinect V2 was used to capture photographs, which

were used as a dataset. We randomly placed a large

number of power-supply tapes on the workbench, as

shown in Figure 7(1), to create a random stacking

environment. To quickly label and meet the

requirements of this study, after each data map was

obtained, several objects were removed to obtain a

new data map, as shown in Figure 7(2) for the follow-

up data. Figure 7(1) shows a data graph comparison of

removed objects. Although this method only reduces

the objects for humans, it provides a new data map to

the system. This method was used 11 times in this

study to rapidly increase the amount of data, using

different random stacking environments. Thus, the

number of objects in the stacking environment was

reduced individually to generate a total of 122 images

as the deep learning network training dataset.

The collected 122 images as a dataset were still

slightly insufficient for deep learning networks. To train

image recognition with good accuracy and

generalization, in addition to the structure of the model,

the most important aspect is the number of datasets. The

larger the amount of data, the more complete it is; the

more complex the structure, the better the training

results. Generally, the collection of image data is often

the most time-consuming process, and in the case of

limited data, the method used in this study is the

simplest and most widely used method: augmentation.

Data augmentation can improve the effectiveness of

deep-learning object detection, [16]. We randomly

enhanced the original images with noise reduction,

flipping at different angles, and brightness processing,

as shown in Figure 8. After data enhancement, 976

images were added, including 1098 original images that

were used for subsequent deep learning.

Fig. 7: (1) Stacked environment data graph, (2) New

data graph with objects removed

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

131

Volume 21, 2024

Fig. 8: Image data enhancement

3.2.2 Image Data Annotation

Before deep learning, apart from data collection,

image annotation is the most time-consuming task.

Humans can easily perceive the location and type of

objects. However, before entering deep learning, first,

the objects must be circled, and the machines need to

be trained on the object types. This process is called

data annotation. To expedite object labeling, the

aforementioned data collection method can be applied

for expediting image labeling. The same method can

be used to copy the frame selection object frame, on

the next image data, delete the frame selection frame

of the removed object, or select a similar frame

selection frame. It can be directly applied to a new

data map to expedite its labeling.

To improve the success rate of gripping by the robotic

arm and avoid gripping collisions, the training objects

could only be recognized when the occluded range was

less than 20%. This setting improves the object

recognition rate and significantly reduces the error rate

and number of annotations required for the image, as

shown in Figure 9.

Fig. 9: Schematic of image annotation

3.2.3 Object Recognition

In deep learning object recognition, we used the Mask

R-CNN proposed in [17]. Mask R-CNN is a very

flexible framework that can add different branches to

achieve different tasks, such as target detection, target

classification, instance segmentation, semantic

segmentation, and human body posture recognition.

We used the COCO pre-training model, [18], to reduce

the training time and a lightweight CNN model,

Inception, to identify various features of graphics with

perfection and importance, [19]. A power supply tape,

with simple features, was the object trained in this

study. According to the characteristics of Inception,

the training graph has larger features. The training

weight is negligible for relatively few smaller features.

Therefore, in theory, the training speed of Inception is

faster than that of ResNet.

3.3 Grasping Position Calculation

3.3.1 Object Grab Position Evaluation

To establish a fast gripping process, a method of self-

defining gripping points was adopted in this study.

This method is different from that proposed in [20].

They proposed a five-dimensional grasping box as a

grasping representation and a system with a two-level

deep network based on a two-dimensional image. As

shown in Figure 10, a shallow CNN determined all

possible grasping rectangles, retained some grasping

rectangles with higher scores, and then determined the

highest-scoring grasping rectangles among the

remaining grasping rectangles through a deep CNN.

That is, the shallow CNN determined and obtained the

best grasping rectangles. After capturing the

rectangular frame, the normal direction of the point

cloud in the center of the rectangular frame was used

as the approach vector of the manipulator; the

detection accuracy rate reached 75%, and the

processing time of each image was approximately 13 s,

[20]. The processing time of this detection algorithm is

relatively low, and the computation is extremely large

and time-consuming.

Therefore, in this study, the center point of the object

was defined as the gripping point and used to open the

gripper as the gripping method. At this point in the box,

the object can be gripped under most postures. Thus, the

collisions between the gripper and the object can be

effectively avoided during gripping, and the success rate

of gripping can be improved. This allowed clamping in

a stacked environment. This method saved the time of

labeling the clipped frame and modeling the clipping

system as only RGB-D was used for calculation. This

reduced the equipment requirements for image

processing, and the computing time was also higher

than when using a large number of point cloud

computing methods.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

132

Volume 21, 2024

Fig. 10: The common grab detection process

3.3.2 Grasping Pose Evaluation of Objects

Evaluating the power tape in the box and the posture

of the object, which is related to the gripping and

success rate of gripping, is necessary. The scattered

and unstacked objects usually have two attitudes: lying

flat or standing, as shown in Figure 11. Among these,

grabbing objects in a flat state does not face problems.

Since an eye-to-hand system was used in this study,

the camera could not move with the robotic arm. The

system failed to identify the central grasping position

of a standing object and to comprehend whether the

four sides of the box would follow the robotic arm

when grasping. This is called an interference problem.

Thus, the robotic arm could not grasp the standing

object at a position relative to the center point of the

object. Therefore, if the object was standing or its

center point was unrecognizable, the object could not

be grasped, and a mechanical arm had to be used. The

arm pushes the object down to a flat state, re-identifies

it, and clamps it.

In a stacking state of the objects in a box, usually,

the objects are inclined to a certain angle, as shown in

Figure 12. The attitude angle of the object to the camera

was converted into the attitude of the object to the robot

arm after the conversion of the hand-eye correction.

Finally, designing the grasping angle by considering a

situation in which the four sides of the box do not

interfere with the grasping is necessary. The robotic arm

was expected to grab the object at the same attitude

angle.

In addition to calculating the gripping pose, depth

information was used to determine the object pose. First,

the objects cut by the Mask R-CNN instance were

divided into four regions, as shown in Figure 13. Then,

we extracted seven 3D spatial information points from

each of the four regions and one 3D spatial information

point from each of the three regions.

We set them as 󰇛󰇜󰇛󰇜󰇛

󰇜and set the plane normal vector as 

󰇍



 󰇛󰇜.

Next, the angle of its normal vector was calculated

using Formula (2), and A, B, and C were substituted

into the three-point coordinates to obtain Formula (3),

which can be obtained by the Formula (4) plane

equation vector and calculate its angle, as shown in

Figure 14. Then, the center point of the object was set as

the origin of the plane angle, and another azimuth

quadrant and the angle in the quadrant were determined

based on the object information and calculated normal

vector angle, as shown in Figure 15. Finally, the thus

obtained object position and attitude were converted

into 

、



、



、



、



、

 of the robot arm through

calculations and sent to the robot arm.



󰇍





󰇍



 



󰇍





󰇍



  (2)

󰇛󰇜󰇛󰇜󰇛󰇜

󰇛󰇜󰇛󰇜󰇛󰇜 (3)

   (4)

Fig. 11: Flat and standing object indication

Fig. 12: Grab posture when objects are stacked

Fig. 13: Four regions of the instance cutting object

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

133

Volume 21, 2024

Fig. 14: Schematic of normal vector angle calculation

Fig. 15: Object Plane Quadrant Angle

3.4 Robotic Grasping Control

The robotic arm was set to move from the object

placement area to the object grasping area for six

movements. To simplify the entire process, the robotic

arm posture was calculated by reading the PC. The

robotic arm was then moved to the top of the object

through three pre-set positions, and finally, the arm was

allowed to perform the grasping action.

4 Result and Discussion

From the experiments, we verified the results and

training time of deep learning using Inception.

The objects in the box were stacked, and the robotic

arm grabbed them until all the objects were clamped or

their image could not be recognized. When the

algorithm incorrectly judged the calculated position of

the object as not within the gripping area, the gripping

system was stopped. The grasping failure occurred

when the pose calculated by the algorithm based on

depth information differed from the required position

pose. Failure to grasp indicates that the object cannot

be successfully grasped or stably placed in the area

after grasping. In an unrecognized state, image

recognition faces algorithmic errors, which prevent the

correct position from being assessed.

Finally, we improved the experimental results using

the depth information, and consequently, the grabbing

rate increased from 90.86% to 95.1%.

4.1 Results of Deep Learning

In this study, Inception was used as the training model for

the Mask R-CNN. The total number of data maps was

1098, of which 997 and 101 were used as the training and

test sets, respectively. The first training result was

obtained after approximately 380,000 iterations. When

the most commonly used 50% IoU was used, the

accuracy reached 0.962 at approximately 30,000

iterations; at 45,000 iterations, the accuracy reached

0.962. 0.985. At 75% IoU and approximately 40,000

iterations, the accuracy reached 0.955. When the number

of iterations was approximately 50,000, the accuracy

reached 0.983, as shown in Figure 16. In terms of speed,

recording every 1,000 iterations required approximately

17 s. Recording of 500,000 iterations was completed in

approximately 2.5 hours plus the storage and operation

time. With the previous data collection, the entire process

could be completed in approximately 4–5 hours, and the

speed could be considered significantly high. The

recognition results are presented in Figure 17.

Fig. 16: Deep Learning Results (Average Precision)

Fig. 17: Mask R-CNN recognition results

4.2 Stacked Gripping Results

In the stacking state, 350 clamping experiments were

performed, and the recognition rate of deep learning

object recognition in the stacking state was found to be

good. However, in various instances, the object

recognition was insufficient. This resulted in errors in the

clamping position calculations. Although the attitude

evaluation was successful 314 times, a positional

deviation that was different from the plane placement was

still observed. Out of the 36 errors in attitude evaluation,

in 33 instances grasping failure occurred, including both

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

134

Volume 21, 2024

the correct and incorrect attitude evaluation, failure to

grasp, and misplacement errors caused by incomplete

object recognition. Finally, the number of successful

clippings, including the evaluation errors, totaled 317.

The identification and gripping rates are summarized in

Table 1.

Table 1. Object stacked grab rate

Evaluate

Object recognition

success rate

100%

Object pose evaluation

is correct

89.71%

Object pose evaluation

error

10.29%

The total number of

gripping

350

times

Grippin

g rate

The object posture

assessment is correct and

the gripping is successful

87.99%

Object pose assessment

fails to grip correctly

1.72%

Object pose evaluation

error, the gripping is

successful

2.87%

Object pose evaluation

error gripping failed

7.42%

The total gripping

success rate

90.86%

4.3 Result Improvement

In this section, the ways to improve the experimental

results are discussed. In most cases, successful gripping

was accomplished using the correct posture. In several

cases, although the posture was incorrect, clipping was

successful. The majority of gripping failure instances

occurred owing to incorrect attitude judgment. This can

be attributed to the use of the depth image data for

calculating the position and attitude and insufficient

accuracy of the Kinect V2 depth image, which resulted in

the misjudgment of attitude. In addition, the small

number of grasping failures can be attributed to the

incomplete cutting of the instance. Even if the grasping

posture is calculated correctly, the gripper cannot move

within the grasping tolerance of the center of the object;

therefore, it cannot grasp an object successfully.

The important data used in attitude evaluation is depth

information. However, the error of object attitude

evaluation was as high as 10%. In the experimental test,

an object was placed in the gripping area to test the

stability of the depth information at four points on the

object. The depth information was then displayed 100

times at the four points. The results are shown in Figure

18. Even at the same time point, the depth values

determined at each time point were significantly different.

The algorithm used three of the four points as the normal

vector and attitude calculation method. According to the

result, the error was ±5 mm, which was also the reason

for the wrong attitude evaluation.

To improve the stability of the abovementioned

unstable depth information and achieve a higher gripping

rate, we modified the original depth information

extraction method and used a camera at the same point

before extracting the information. The depth information

was extracted several times, and the low-pass rate wave

method was used to extract the waveform with the highest

weight ratio from the depth information. The results are

shown in Figure 19; the depth of the information error

was significantly improved.

Finally, based on the improved depth information, we

performed a gripping test in a stacking environment. A

total of 102 grip tests were conducted, and 96 successful

attitude evaluations and six attitude evaluation errors

including five gripping failures were obtained. The

second time included both the correct and incorrect

attitude evaluations, clamping failure occurrence, and

misplacement error caused by incomplete object

recognition. The number of successful gripping events

included evaluation errors and 97 successful gripping

events. This was also compared with 109 gripping tests of

the original depth information acquisition method. The

stability and gripping rate improved to a certain extent.

The identification and grip rates are listed in Table 2.

Fig. 18: Depth information before filtering

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

135

Volume 21, 2024

Fig. 19: Depth information after filtering

Table 2. Comparison of the gripping rate before and

after filtering

before filtering

after filtering

Object

recognition

success rate

100%

Object pose

evaluation is

correct

89.91%

94.12%

Object pose

evaluation error

10.09%

5.88%

The total number

of gripping

109 times

102 times

Gripping rate

(before filtering)

Gripping rate

(after filtering)

Correct object

posture

assessment and

successful

gripping

88.07%

93.14%

Incorrect object

pose assessment

and incorrect

gripping

1.84%

0.98%

Object pose

evaluation error;

successful

gripping

3.67%

1.96%

Object pose

evaluation error

gripping failed

6.41%

3.92%

The total

gripping success

rate

91.74%

95.1%

5 Conclusions

In this study, robotic arms, computer vision, and deep

learning were integrated to develop a simplified and

faster-to-implement grasping system compared to the

time-consuming and intricate stacking grasping systems.

The focus was on mass-produced objects. In the

experimental phase, we opted for simpler objects for

grasping tests by employing a manual method to set the

gripping points. This approach facilitated swift data

collection and labeling and expedited the preliminary

steps for deep learning. For the training model, we

selected Inception owing to its faster training capabilities.

Overall, ResNet outperformed Inception in terms of

recognition accuracy; however, Inception was selected

owing to its faster model training. The final experimental

results were consistent with our expectations.

Furthermore, in contrast to other studies, our approach

does not rely on point clouds or CAD modeling. Instead,

we utilized color and a limited amount of depth

information to establish the posture of an object in a 3D

space. This enabled the robotic arm to grasp objects based

on the posture direction. In experiments using this

grasping method, we achieved success rates exceeding

95% for both unstacked and stacked objects.

Consequently, our study offers advantages in terms of the

speed at which the entire grasping system can be

established and the overall success rate.

Although this study introduced a grasping system that

can be quickly deployed and attains a commendable

success rate, the system can be improved further.

 Diversifying the types of objects that can be

identified. In this study, we focused on a single-

object type, which limits its applicability. Our

goal was to enable the robotic arm to handle

various objects that could be randomly stacked

within the grasping area without compromising

the success rate.

 Furthermore, we aim to improve the depth point

accuracy by employing a more stable RGB-D

camera, thus enhancing the overall attitude

evaluation success rates and grip efficiency.

 To address challenges, such as collisions, and

further improve the success rate, we envisage

incorporating a force sensor into a robotic arm in

the future. This sensor will allow the system to

detect errors during the grasping process by

relying on force feedback to determine whether

the gripper has contacted the intended object. By

analyzing the force data, the robotic arm can

assess if a successful grip has been achieved or if

it should return to the grasping waiting area for

image recognition.

References:

[1] H.Y. Kuo, H.R. Su, S.H. Lai, and C.C. Wu,

“3D Object Detection and Pose Estimation

from Depth Image for Robotic Bin Picking,”

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

136

Volume 21, 2024

Proc. of IEEE Int'l Conf. on Automation

Science and Engineering, pp. 1264-1269, 2014.

[2] C. Wu, S. Jiang, and K. Song, "CAD-based

pose estimation for random bin-picking of

multiple objects using an RGB-D camera,"

2015 15th International Conference on

Control, Automation, and Systems (ICCAS),

2015, pp. 1645-1649.

[3] L. Pinto and A. Gupta, "Supersizing self-

supervision: Learning to grasp from 50K tries

and 700 robot hours," 2016 IEEE International

Conference on Robotics and Automation

(ICRA), 2016, pp. 3406-3413.

[4] J. Mahler, J. Liang, S. Niyaz, M. Laskey, R.

Doan, X. Liu, J. Aparicio, and K. Goldberg.

2017. Dex-Net 2.0: Deep Learning to Plan

Robust Grasps with Synthetic Point Clouds and

Analytic Grasp Metrics,

https://doi.org/10.48550/arXiv.1703.09312.

[5] H. Wang, H. Situ, and C. Zhuang, "6D Pose

Estimation for Bin-Picking based on Improved

Mask R-CNN and DenseFusion," 2021 26th

IEEE International Conference on Emerging

Technologies and Factory Automation (ETFA),

2021, pp. 1-7.

[6] D. Morrison, P. Corke, and J. Leitner. Closing

the loop for robotic grasping: a real-time,

generative grasp synthesis approach. 2018,

arXiv: 1804.05172,

https://doi.org/10.48550/arXiv.1804.05172.

[7] D. Guo, F. Sun, H. Liu, T. Kong, B. Fang, and

N. Xi, “A hybrid deep architecture for robotic

grasp detection,” IEEE International

Conference on Robotics and Automation

(ICRA). Singapore: IEEE, 2017: 1609-1614.

[8] J. Redmon and A. Angelova. Real-time grasp

detection using convolutional neural networks.

2014, arXiv: 1412.3128,

https://doi.org/10.48550/arXiv.1412.3128.

[9] S. Kumra and C. Kanan. Robotic grasp

detection using deep convolutional neural

networks. 2016, arXiv: 1611.08036,

https://doi.org/10.48550/arXiv.1611.08036.

[10] J. Jiao, L. Yuan, W. Tang, Z. Deng, and Q. Wu.

“A Post-Rectification Approach of Depth

Images of Kinect v2 for 3D Reconstruction of

Indoor Scenes,” ISPRS International Journal of

Geo-Information, 2017; 6(11):349.

[11] Z. Zhang, “Flexible camera calibration by

viewing a plane from unknown orientations,”

Proceedings of the Seventh IEEE International

Conference on Computer Vision, 1999, pp. 666-

673.

[12] R. Y. Tsai and R. K. Lenz, "A new technique

for fully autonomous and efficient 3D robotics

hand/eye calibration," in IEEE Transactions on

Robotics and Automation, vol. 5, no. 3, pp.

345-358.

[13] F. C. Park and B. J. Martin, "Robot sensor

calibration: solving AX=XB on the Euclidean

group," in IEEE Transactions on Robotics and

Automation, vol. 10, no. 5, pp. 717-721, Oct.

1994.

[14] R. Horaud and F. Dornaika. Hand-eye

Calibration. The International Journal of

Robotics Research, SAGE Publications, 1995,

14 (3), pp.195–210.

[15] N. Andreff, R. Horaud, and B. Espiau, “On-line

hand-eye calibration,” Second International

Conference on 3-D Digital Imaging and

Modeling(Cat.No.PR00062),1999, pp.430-436.

[16] H.Y. Kuo, H.R. Su, S.H. Lai, and C.C. Wu,

“3D Object Detection and Pose Estimation

from Depth Image for Robotic Bin Picking,”

Proc. of IEEE Int'l Conf. on Automation

Science and Engineering, pp.1264-1269, 2014.

[17] K. He, G. Gkioxari, P. Dollár, and R. Girshick,

“Mask R-CNN,” 2017 IEEE International

Conference on Computer Vision (ICCV), 2017,

pp. 2980-2988.

[18] Cocodataset. COCO, [Online].

https://cocodataset.org/#home (Accessed Date:

February 27, 2024).

[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.

Reed, D. Anguelov, D. Erhan, V. Vanhoucke,

and A. Robinovich, “Going deeper with

convolutions,” Proc. IEEE Comput. Soc. Conf.

Comput. Vis. Pattern Recognit., vol. 07–12-

June, pp. 1–9

[20] I. Lenz, H. Lee, and A. Saxena, “Deep learning

for detecting robotic grasps,” International

Journal of Robotics Research, 2015, 34(4/5):

705-724.

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

137

Volume 21, 2024

Contribution of Individual Authors to the Creation

of a Scientific Article (Ghostwriting Policy)

- S.H. Lee carried out the experiment and analyzed the

results.

- B.R. Zhu wrote the manuscript. J.S. Shaw conceived

the experiment and reviewed the manuscript.

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

This study was supported by a grant from the National

Science and Technology Council, Taiwan (NSTC 112-

2218-E-027-006).

Conflict of Interest

The authors declare no conflict of interest.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en_

WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS

DOI: 10.37394/23209.2024.21.13

Bo-Rui Zhu, Jin-Siang Shaw, Shih-Hao Lee

E-ISSN: 2224-3402

138

Volume 21, 2024