A text feature selection method based on category-distribution divergence

Yonghe Lu, Wenqiu Liu, Xinyu He

Abstract


The purpose of this paper is to overcome the problem that traditional feature selection methods [such as document frequency(DF), Chi-square statistic(CHI), information gain(IG), mutual information(MI) and Odds ratio(OR)] do not consider the distribution of features among different categories. The work aims at selecting the features that can accurately represent the theme of texts and to improve the accuracy of classification. In this paper, we propose a text feature selection method based on Category-Distribution Divergence, and the degree of membership and degree of non-membership are introduced into CDDFS (feature selection based on category-distribution divergence). CDDFS is used as a filter which can filter the features having low degree of membership and high degree of non-membership. CDDFS is tested with five feature selection methods and three classifiers using the corpus of Sogou Lab Data, and experimental results show that this method performs better than other feature selection methods when using KNN, and close to CHI when using Rocchio algorithms and SVM at high dimensions. This research proposes the representativeness and distinguishability of feature for category, and the representativeness and distinguishability of feature for non-category. If a feature has good distinguishability and high representativeness, then this feature will be retained in feature selection.


Full Text:

PDF


DOI: https://doi.org/10.5430/air.v4n2p143

Refbacks

  • There are currently no refbacks.


Artificial Intelligence Research

ISSN 1927-6974 (Print)   ISSN 1927-6982 (Online)

Copyright © Sciedu Press 
To make sure that you can receive messages from us, please add the 'Sciedupress.com' domain to your e-mail 'safe list'. If you do not receive e-mail in your 'inbox', check your 'bulk mail' or 'junk mail' folders.