Highlighting Current Issues in API Usage Mining to Enhance

Software Reusability

MUSA IBRAHIM M. ISHAG

Chungbuk National University

College of Electrical and Computer Engineering

Database/Bio informatics Laboratory

Chungda-ro 1, Seowon-Gu, Cheongju

SOUTH KOREA

HYUN WOO PARK

Chungbuk National University

College of Electrical and Computer Engineering

Database/Bio informatics Laboratory

Chungda-ro 1, Seowon-Gu, Cheongju

SOUTH KOREA

DINGKUN LI

Chungbuk National University

College of Electrical and Computer Engineering

Database/Bio informatics Laboratory

Chungda-ro 1, Seowon-Gu, Cheongju

SOUTH KOREA

KEUN HO RYU

Chungbuk National University

College of Electrical and Computer Engineering

Database/Bio informatics Laboratory

Chungda-ro 1, Seowon-Gu, Cheongju

SOUTH KOREA

Abstract: The sheer amount of open source codes made available in code repositories and code search engines

along with the rapidly increasing releases of Application Programming Interfaces (APIs) have made code devel-

opment process easier for programmers. However, learning how to use the elements of an API properly is both

challenging and requires learning curve. Mining the available client and test codes can help programmers to iden-

tify the best practices in using these APIs. In this paper, we investigate the API usage mining to identify open

issues for the researchers. In particular, we make a theoretical comparison of the API usage pattern mining and

highlight unresolved issues along with proper suggestions to address them.

Key–Words: API usage patterns, Mining software engineering data, association rules,frequent patterns

1 Introduction

Application Programming Interfaces (APIs) are

facilitating source code reusability for programmers.

Recently, the number of APIs made available to the

programmers has drastically increased in different do-

mains generating a huge number of reusable code el-

ements. Motivated by this sheer amount of data, re-

searchers have devised methods for mining software

engineering data [1] .

A major current focus of applying data mining to

software engineering data, is mining API usage pat-

terns [9]. Researchers basically apply data mining to

extract patterns that can serve both in code reusability

[10] and to detect violations [14], [13].

A pattern can be considered as a violating pattern

in regard to multiple factors such as a sequence of

code elements that if followed can cause huge energy

consumption in the device that implements it. In con-

trast a reusable pattern is the best practice that is usu-

ally demanded by programmers.

In order to ﬁnd reusable patterns the most used and

applied data mining techniques are association rule

mining and clustering. A recent survey was reported

in [8], where the authors have empirically evaluated

the efﬁciency of applying itemset mining and sequen-

tial pattern mining to the problem of mining call-

usage patterns. More recently, Shaheen and Azhar [9]

have reviewed source code mining techniques where

they have categorized the techniques into three gen-

eral categories; namely, programming rules, copy-

paste detection, and API usage.

In this paper, the most used techniques of API us-

age pattern mining are investigated in order to help

researchers progress in this direction. In essence,

the paper describes the general framework of mining

API usage patterns, evaluates the techniques used, and

highlights the current issues along with viable sugges-

tions for addressing them. Therefore the key contribu-

tions of this paper can be summarized in the follow-

ing:

•Theoretical comparison and evaluation of the

API usage pattern mining techniques.

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2022.10.4

Musa Ibrahim M. Ishag,

Hyun Woo Park, Dingkun Li, Keun Ho Ryu

E-ISSN: 2415-1521

Volume 10, 2022

•Highlight issues in the existing systems and sug-

gest possible enhancements.

The following section deﬁnes the problem of

API usage mining categorizing it into three classes.

Namely, frequent itemset based, frequent sequence

based, and graph based techniques. Afterwards, a the-

oretical comparison is given along with a highlight

about the current issues and possible suggestions. Fi-

nally, a summary concludes this article.

2 API Usage Pattern Mining

A good practice in software development is to pro-

duce reusable source codes. This goal is achieved at

its best by means of APIs. However, the most chal-

lenging task facing software developers is the learn-

ing curve required in order to accomplish their coding

tasks using APIs. Many reasons combined attribute

to this challenge including lack of proper documen-

tation, shortages in publicly available client codes-

codes that use elements of the API, in addition to the

rapid increase in the newly published APIs.

The smart move by researchers to utilize data min-

ing as a mean for discovering these reusable patterns

[1],[15] has resulted in a number of available tools for

software engineers. In order to better understand these

tools a deﬁnition to the problem is given bellow.

2.1 Problem Deﬁnition

Generally, the problem of API usage pattern min-

ing can be deﬁned as the process of ﬁnding the proper

usage sequence or order of a group of reusable code

elements within an API. In this sense, the frequent

pattern mining [16]-a step in association rule analy-

sis [29], and data clustering are inevitable.

To illustrate the process of frequent usages of API,

consider the hypothetical code example in Figure 2.

The local methods of the client code are the trans-

actions. Whereas, the code elements represented as

methods calls can be considered as items in a tradi-

tional market basket analysis. The goal will be to ﬁnd

method calls that frequently occur together in order to

form implication rules. From the example the follow-

ing taxonomies are demanded:

Support count of a code element is the number

of times it occurs in code elements of a client or test

codes. (i.e transactions).

API-usage pattern (call-usage pattern) is an im-

plication relation.

Support of an API-usage rule is the support of its

elements.

Conﬁdence of an API-usage rule is the relative

occurrence of the code elements contained in the rule.

For the rule to be signiﬁcant, it must both be frequent

by satisfying a user provided minimum support and

conﬁdent by satisfying a minimum conﬁdence thresh-

old.

Figure 1 describes the general framework for API

usage mining where the data sets are constituted from

collections of client codes available on the web which

can be gathered by querying code search engines, and

test codes that might be found on API documentation

ﬁles. An example of popular search engines is pro-

vided in table 1.

Based on the way these data sets are preprocessed,

three paths of mining techniques can be distinguished.

Namely; frequent itemset based, frequent sequence

based and frequent graph based methods.

2.2 Frequent Itemset based Methods

An algorithm that follows this approach formulates

the problem by applying a typical analogy of the tradi-

tional frequent itemset mining to mine the frequent us-

age patterns. In essence the source code data set must

be preprocessed and converted into transactions con-

taining items. That is local methods represent trans-

actions and API methods called within these methods

represent items. Afterwards, traditional itemset min-

ing algorithms (Apriori [29], FP-growth [30]) can be

applied to discover the patterns and formulate rules.

Figure 2.b how a source code data set converted into

a traditional market basket transaction data.

2.3 Frequent Sequence based Methods

In this case, algorithms following this approach

preprocess the source code data set and convert it

into sequences of API method calls where the co-

occurrences and the order of calls matter. Thereafter,

the task becomes ﬁnding frequently occurring ordered

sequences of method calls. Therefore, traditional fre-

quent sequence mining algorithms can possibly be ap-

plied. This process is shown in ﬁgure 2.c.

2.4 Frequent Graph based Methods

Methods following this approach model the API

method call sequences as directed acyclic graphs.

Therefore, in the preprocessing step the source code

data will have to be converted into call sequence

graphs or subgraphs. The task will then be looking

for frequent graphs in order to formulate rules. Figure

2.d illustrates this conversion.

The discovered patterns and association rules are

usually incorporated into Integrated Development En-

vironments (IDEs) in order to help in code sugges-

tions. Although this is not fully realized currently, it

will lead to a new generation of intelligent IDEs which

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2022.10.4

Musa Ibrahim M. Ishag,

Hyun Woo Park, Dingkun Li, Keun Ho Ryu

E-ISSN: 2415-1521

Volume 10, 2022

Table 1: Popular Code Search Engines.

Figure 1: A general Framework for Mining API-

Usage Patterns.

will be capable of providing extra functionalities

to the software developers using them as it is illus-

trated in ﬁgure 1.

3 Comparing The Methods

In this paper, we have considered the most popu-

larly cited works and tools available in the literature.

These include CodeWeb[2], Strathcona [3], Prospec-

tor [4], XSnippet [5], MAPO [6], and ParseWeb[7]

which have previously been studied and compared

from different prospective by Shaheen and Azhar [9].

We have additionally considered most recent

studies like UsETeC [10], and eXoaDoc [11],[12].

The algorithms are compared according the following

three dimensions:

•Data Sources

Based on the data set used in the mining pro-

cess, the current tools fall into two basic cate-

gories. Those using client codes from the in-

ternet through issuing queries to code search en-

gines as explained in table 1, and others that uti-

lize the test examples from the associated docu-

mentations. From the comparison shown in ta-

ble 2, CodeWeb, Strathcona, Prospector, XSnip-

pet, MAPO, and ParseWeb use client codes from

the web. Whereas, only UsETeC and eXoaDoc

exploit the test examples. In essence, eXoaDoc

helps in adding proper test code examples to the

API documentation.

•Patterns

The patterns extracted can also fall in the three

general categories explained in ﬁgure 2. Among

the studies considered in this paper, MAPO,

ParseWeb, and UsETeC search for sequential

patterns. Prospector and XSnippet ﬁnd graph

patterns. Whereas, only CodeWeb is searching

for frequent items.

•Algorithmic Approach

Some of the algorithms consider the data sets are

stored in external storage devices. Therefore, a

scan is performed to read the data from the disc

to memory. Whereas, others consider data struc-

tures like trees to compress and store the entire

data set in memory and perform the mining. All

algorithms except MAPO need to read the data

from external discs.

4 Open Issues

Based on the above comparisons we can distin-

guish the following as issues and directions for re-

searchers to investigate. In essence we categorize

them into four main classes. Namely, the data sets,

scalability, algorithms and tools.

•Data sources although some data sets are avail-

able for researchers which include source code

repositories and search engines, still the problem

of getting representative data needs to be stud-

ied. In particular, new direction is emerging that

tries to enrich the documentations of the newly

release APIs with test examples. UsETeC [10]

and eXoaDoc [11, 12] are leading this direction.

•Scalability the algorithms and tools developed

so far need to be re-engineered in order to scale to

the increasing release of new software and APIs.

A possible suggestion here is to exploit the ca-

pabilities of BIG DATA[26] tools and technolo-

gies. Therefore, new scalable algorithms can be

based on Hadoop[27] and MapReduce[28] pro-

gramming model.

•Algorithms The way the current algorithms are

developed is solely based on the assumption of

key-value representation of the code elements.

That is the algorithms consider the existence or

absence of an item. However, in real world ex-

ample of software development, occurrences of

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2022.10.4

Musa Ibrahim M. Ishag,

Hyun Woo Park, Dingkun Li, Keun Ho Ryu

E-ISSN: 2415-1521

Volume 10, 2022

Figure 2: Hypothetical Source Code Dataset and its possible representations

Table 2: Theoretical Comparison of Popular Methods.

code elements might have different signiﬁcant.

For example, one can distinguish system calls

from regular method calls. Therefore, sophis-

ticated algorithmic approaches that reﬂect these

asymmetric relationships are needed. A possible

solution is to adapt more advanced pattern min-

ing approaches. Where weighted frequent pat-

terns and utility based mining might inspire re-

searchers to tackle this issue.

•Need for tools and intelligent IDEs As ex-

plained in ﬁgure 2, a possible utilization of

the discovered patterns is to integrate them into

IDEs. This might lead to new generations of in-

telligent IDEs that might suggest not only a code

completion but also would suggest examples.

Another direction is the lack of dedicated web

and cloud services that might help providing best

practices as a service for clients (i.e software devel-

opers). A possible use case of such services would be

the ability for software developers to

submit their source codes for investigation

before the actual release. Thus, it can be considered a

great contribution towards intelligent software testing.

Addressing the above mentioned issues will

result in an advanced practice in software engineering

both in the reusability, and security and testing.

5 Conclusion

The development of reusable software is one of the

corner stones of software development. Towards this

direction, a plethora of open source codes are avail-

able and being circulated online. APIs are the core of

this reusable software. However, to reduce the learn-

ing curve spent by a programmer in learning how to

use the code elements of these APIs, researchers have

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2022.10.4

Musa Ibrahim M. Ishag,

Hyun Woo Park, Dingkun Li, Keun Ho Ryu

E-ISSN: 2415-1521

Volume 10, 2022

adopted data mining as a solution. In this paper, a

theoretical evaluation and comparison was conducted

comparing the most popular and current tools avail-

able. The comparison considered three main dimen-

sions. Namely the source of the data set used the type

of API-usage patterns discovered, and the algorithmic

approach followed by these tools.

Concurrently, the paper has distinguished three is-

sues to further the API-usage mining research. In par-

ticular, the issues of data sources, scalability, and al-

gorithms and the need for tools and intelligent IDEs

are still open for research and contributions. To this

end, the paper has given possible suggestions. c.

Acknowledgements: This research was supported by

the MSIP(Ministry of Science, ICT and Future Plan-

ning), Korea, under the ITRC(Information Technol-

ogy Research Center) support program (IITP-2015-

H8501-15-1013) supervised by the IITP(Institute

for Information & communication Technology Pro-

motion), and by Basic Science Research Program

through the National Research Foundation of Korea

(NRF) funded by the Ministry of Science, ICT & Fu-

ture Planning (No.2013R1A2A2A01068923).

References:

[1] A. Hassan, and T. Xie, Mining software

engineering data, in Proceedings of the 32nd

ACM/IEEE International Conference on Soft-

ware Engineering-Volume 2, 2010, pp. 503-504.

[2] A. Michail, Data mining library reuse patterns

using generalized association rules, in Proceed-

ings of 22nd International Conference on Soft-

ware Engineering (ICSE’00), Limerick, Ireland,

2000, pp 167-176.

[3] R. Holmes, and G. C. Murphy, Using structural

context to recommend source code examples, in

Proceedings of the 27th international conference

on Software engineering, 2005, pp. 117-125. .

[4] D. Mandelin, L. Xu, R. Bodk et al., Jun-

gloid mining: helping to navigate the API jun-

gle, ACM SIGPLAN Notices, vol. 40, no. 6, pp.

48-61, 2005.

[5] N. Sahavechaphan, and K. Claypool, XSnip-

pet: mining for sample code, ACM SIGPLAN

Notices.

[6] T. Xie, and J. Pei, MAPO: Mining API usages

from open source repositories, in Proceedings of

the 2006.

[7] S. Thummalapenta, and T. Xie, Parseweb: a

programmer assistant for reusing open source

code on the web, in Proceedings of the twenty-

second IEEE/ACM international conference on

Automated software engineering, 2007, pp. 204-

213. international workshop on Mining software

repositories, 2006, pp. 54-57. vol. 41, no. 10, pp.

413-430, 200.

[8] Kagdi, Huzefa, Michael L. Collard, and

Jonathan I. Maletic. Comparing approaches

to mining source code for call-usage patterns,

Mining Software Repositories, 2007. ICSE

Workshops MSR’07. Fourth International Work-

shop on. IEEE, 2007.

[9] Khatoon, Shaheen, Azhar Mahmood, and

Guohui Li. An evaluation of source code mining

techniques, Fuzzy Systems and Knowledge Dis-

covery (FSKD), 2011 Eighth International Con-

ference on. Vol. 3. IEEE, 2011.

[10] Zhu, Zixiao, et al. Mining api usage examples

from test code, Software Maintenance and Evo-

lution (ICSME), 2014 IEEE International Con-

ference on. IEEE, 2014.

[11] Kim, J., Lee, S., Hwang, S. W., and Kim, S.

Adding examples into java documents, In Proc.

of ASE09. pp. 540-544.

[12] Kim, J., Lee, S., Hwang, S. W., and Kim, S.

Enriching Documents with Examples: A Corpus

Mining Approach, ACM Transactions on Infor-

mation Systems (TOIS), 31(1) (2013), pp.

[13] Linares-Vsquez, Mario, et al. Mining energy-

greedy API usage patterns in Android apps: an

empirical study, Proceedings of the 11th Work-

ing Conference on Mining Software Reposito-

ries. ACM, 2014.

[14] Aafer, Yousra, Wenliang Du, and Heng

Yin, DroidAPIMiner: Mining API-level features

for robust malware detection in android, Secu-

rity and Privacy in Communication Networks.

Springer International Publishing, 2013. 86-103.

[15] Mendez, Diego, Benoit Baudry, and Martin

Monperrus, Analysis and Exploitation of Natu-

ral Software Diversity: The Case of API Usages,

Diss. Inria, 2014.

[16] Han, Jiawei, Micheline Kamber, and Jian Pei.

Data mining, southeast asia edition: Concepts

and techniques. Morgan kaufmann, 2006.

[17] Search Code, https://searchcode.com/

[18] Black Duck Open Hub,

http://code.openhub.net/

[19] Codase Site, http://www.codase.com/

[20] Google Code, https://code.google.com

[21] Krugle, http://www.krugle.com/

[22] F1 Source Code,

http://www.f1sourcecode.com/

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2022.10.4

Musa Ibrahim M. Ishag,

Hyun Woo Park, Dingkun Li, Keun Ho Ryu

E-ISSN: 2415-1521

Volume 10, 2022

[23] Nerdy Data, http://nerdydata.com/

[24] Symbol Hund, http://www.symbolhound.com/

[25] Mean Path, https://meanpath.com/

[26] Manyika, James, et al., Big data: The next fron-

tier for innovation, competition, and productiv-

ity, 2011.

[27] White, Tom. Hadoop: the deﬁnitive guide: the

deﬁnitive guide. O’Reilly Media, Inc., 2009.

[28] Dean, Jeffrey, and Sanjay Ghemawat, MapRe-

duce: simpliﬁed data processing on large clus-

ters, Communications of the ACM 51.1 (2008):

107-113.

[29] Agrawal Rakesh, and Ramakrishman Srikant.

Fast Algorithms for Mining Association Rules in

Large Databases, In VLDB, 1994.

[30] Jiawei Han, Jian Pei, and Yiwen Yin. Mining

frequent patterns without candidate generation,

In SIGMOD, 2000.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the Creative

Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en_US

WSEAS TRANSACTIONS on COMPUTER RESEARCH

DOI: 10.37394/232018.2022.10.4

Musa Ibrahim M. Ishag,

Hyun Woo Park, Dingkun Li, Keun Ho Ryu

E-ISSN: 2415-1521

Volume 10, 2022