Impact of feature selection techniques in Text Classification: An Experimental study

Authors:

S. Rahamat Basha,J.Keziya Rani,JJC Prasad Yadav,G.Ravi Kumar,

DOI NO:

https://doi.org/10.26782/jmcms.spl.3/2019.09.00004

Keywords:

Stop word removal,stemming,feature weighting and selection,K-NN,Naïve Bayesian,

Abstract

This work is a study of comparing different feature selection techniques on the accuracy of text classification. Text Mining or Document Categorization is a supervised learning (an Information Retrieval task which learns from labeled train data) technique where it uses labeled (set of instances with predefine labels) train instances or data to learn the categorization job and then it categorize the test text instances automatically using the system that is learnt. In the field of IR and management tasks, classification plays an important lead. The text categorization procedure includes the steps text pre-processing (cleaning, stop word removal and stemming), feature extraction or feature reduction or feature selection and then categorization. In this work, two machine learning algorithm/classifiers (Naïve Bayes and K-Nearest Neighbor) are used for classification. The analyzed experimental results show that Naïve Bayes algorithm gives more accuracy in many cases i.e. with many feature selection techniques and K-Nearest Neighbor classifier works well only in the cases, when the feature selection techniques either Information Gain (IG) or Mutual Information (MI). The results of experiments reported here were generated while Self-made corpus used for training and Reuters-21578 corpus used for testing.

Refference:

I. “A Fuzzy Self-Constructing Feature Clustering Algorithm for Text
Classification”, IEEE transactions on knowledge and data engineering, Vol.:
23, Issue: 3, March 2011.
II. A.M. Martinez and A.C. Kak, “PCA versus LDA”, IEEE Trans. Pattern
Analysis and Machine Intelligence, Vol.: 23, Issue: 2, pp. 228-233, Feb.
2001.
III. An algorithm for suffix stripping by M. F. Porter,
http://maya.cs.depaul.edu/~classes/ds575/papers/porter-algorithm.html, 1980.
IV. Eghbal G. Mansoori and Khadijeh S. Shafiee, “On fuzzy feature selection in
designing fuzzy classifiers for high-dimensional data”, Evolving Systems,
Vol.:7, Issue:4, pp 255–265, December 2016.

V. F.Sebastani, “Machine Learning in Automated Text Categorization”, ACM
Computing Surveys, Vol.: 34, Issue: 1, pp.1-47, 2002.
VI. H.kim, p. Howland, and H. park, “Dimension Reduction in Text
Classification with Support Vector Machines”, J.Machine Learning Research,
Vol.: 6, pp. 37-53, 2005.
VII. H. Park, M. Jeon, and J. Rosen, “Lower Dimensional Representation of Text
Data Based on Centroids and Least Squares”, BIT Numerical Math, Vol.: 43,
pp. 427-448, 2003.
VIII. https://archive.ics.uci.edu/ml/datasets/Reuters-
21578+Text+Categorization+Collection.
IX. https://voyant-tools.org/?corpus=1621ff41879200779eb5bf827f2e3881
X. I.T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986.
XI. N. Slonim and N. Tishby, “The Power of Word Clusters for Text
Classification”, Proc. 23rd European Colloquium on Information Retrieval
Research (ECIR), 2001.
XII. Oystern Lohre Garnes, Kjetil Norvag, Robert Neumayer, “Feature Selection
for Text Categorization”, Norwegian University of Science and Technology,
June 2009.

View | Download