Genetic Algorithm Based Feature Selection in High Dimensional Text Dataset Classification

Ferhat Ӧzgür Ҫatak

Search Articles

WSEAS Transactions on Information Science and Applications

Print ISSN: 1790-0832, E-ISSN: 2224-3402

Volume 12, 2015

Genetic Algorithm Based Feature Selection in High Dimensional Text Dataset Classification

Author:

Ferhat Ӧzgür Ҫatak

Abstract: Vector space model based bag-of-words language model is commonly used to represent documents in a corpus. But this representation model needs a high dimensional input feature space that has irrelevant and redundant features to represent all corpus files. Non-Redundant feature reduction of input space improves the generalization property of a classifier. In this study, we developed a new objective function based on models F1 score and feature subset size based. In this paper, we present work on genetic algorithm for feature selection in order to reduce modeling complexity and training time of classification algorithms used in text classification task. We used genetic algorithm based meta-heuristic optimization algorithm to improve the F1 score of classifier hypothesis. Firstly; (i) we’ve developed a new objective function to maximize; (ii) then we choose candidate features for classification algorithm; and (iii) finally support vector machine (SVM), maximum entropy (MaxEnt) and stochastic gradient descent (SGD) classification algorithms are used to find classification models of public available datasets.

Keywords: Feature selection, support vector machines, logistic regression, stochastic gradient descent, document classification

Pages: 290-296

WSEAS Transactions on Information Science and Applications, ISSN / E-ISSN: 1790-0832 / 2224-3402, Volume 12, 2015, Art. #28

PDF

Search Articles

Genetic Algorithm Based Feature Selection in High Dimensional Text Dataset Classification

Citation Tools