Conference Proceedings


Gareth J.F. Jones
Debasis Ganguly
Piyush Arora



Nearest neighbour based transformation functions for text classification: a case study with StackOverflow (2016)

Abstract The significant growth in the number of questions in question answering forums has led to increasing interest in text categorization methods for classifying newly posted questions as good (suitable) or bad (otherwise) for the forum. Standard text categorization approaches, e.g. multinomial Naive Bayes, are likely to be unsuitable for this classification task because of: i) the lack of sufficient informative content in the questions due to their relatively short length; and ii) considerable vocabulary overlap between the classes. To increase the robustness of this classification task, we propose to use the neighbourhood of existing questions which are similar to the newly asked question. Instead of learning the classification boundary from the questions alone, we transform each question vector into a different one in the feature space. We explore two different neighbourhood functions using: the discrete term space, the continuous vector space of real numbers obtained from vector embeddings of documents. Experiments conducted on StackOverflow data show that our approach of using this neighborhood transformation can improve classification accuracy by up to about 8% as compared to using just unigram textual features.
