James Mutinda - Authorea

Sentiment analysis of social media posts and texts can provide information and knowledge that is applicable in social settings, business intelligence, evaluation of citizens’ opinions in governance and mood triggered devices in Internet of Things. Feature extraction and selection is a key determinant of accuracy and computational cost of machine learning models for such analysis. Most feature extraction and selection techniques utilize bag of words such as N-grams and frequency-based algorithms especially Term Frequency-Inverse document frequency (TF-IDF). However, these approaches suffer shortcomings such as; they do not consider relationships between words, they ignore words’ characteristics and they suffer high feature dimensionality. In this paper we propose and evaluate an approach that utilizes a fixed hybrid N-gram window for feature extraction and Minimum Redundancy Maximum Relevance feature selection for sentence level sentiment analysis. The approach improves the existing feature extraction techniques specifically the N-gram by generating a tri-gram vector from words, Part of speech tags and word semantic orientation. The N-gram vector is extracted by employing a static 3-gram window identified by a lexicon where a sentiment word appears in a sentence. A blend of the words, POS tags and the sentiment orientations of the 3N-gram are used to build the feature vector. The optimal features from the vector are then selected using Minimum Redundancy Maximum Relevance (MR2) algorithm. Experiments were carried out with a publicly available yelp tweets dataset to evaluate the performance of four supervised machine learning classifiers (Naïve Bayes, K-Nearest Neighbor, Decision Tree and Support Vector Machines) when augmented with the proposed model. The results showed that the proposed model had the highest accuracy (86.85%), recall (86.85%) and precision (86.96%).