WSEAS Transactions on Information Science and Applications
Print ISSN: 1790-0832, E-ISSN: 2224-3402
Volume 12, 2015
Genetic Algorithm Based Feature Selection in High Dimensional Text Dataset Classification
Author:
Abstract: Vector space model based bag-of-words language model is commonly used to represent documents in a corpus. But this representation model needs a high dimensional input feature space that has irrelevant and redundant features to represent all corpus files. Non-Redundant feature reduction of input space improves the generalization property of a classifier. In this study, we developed a new objective function based on models F1 score and feature subset size based. In this paper, we present work on genetic algorithm for feature selection in order to reduce modeling complexity and training time of classification algorithms used in text classification task. We used genetic algorithm based meta-heuristic optimization algorithm to improve the F1 score of classifier hypothesis. Firstly; (i) we’ve developed a new objective function to maximize; (ii) then we choose candidate features for classification algorithm; and (iii) finally support vector machine (SVM), maximum entropy (MaxEnt) and stochastic gradient descent (SGD) classification algorithms are used to find classification models of public available datasets.
Search Articles
Keywords: Feature selection, support vector machines, logistic regression, stochastic gradient descent, document classification
Pages: 290-296
WSEAS Transactions on Information Science and Applications, ISSN / E-ISSN: 1790-0832 / 2224-3402, Volume 12, 2015, Art. #28