Introduction to datasets
Five common public datasets are used for experiments (Table 1): 20NG is a corpus containing 20 categories, with a total of 11,314 documents in the training set and 7532 documents in the test set; R52 and R8 are two subsets of the Reuters dataset (R8 has eight categories and is split into 5485 training documents and 2189 test documents; R52 has 52 categories, divided into 6532 training documents and 2568 test documents); Ohsumed is a database from the medical sciences that contains 23 categories. Herein, 7400 documents are selected, among which 3357 documents are in the training set and 4043 documents are in the test set. MR is a dataset of movie reviews for binary sentiment classification, where each review contains only one sentence. There are 5331 positive and 5331 negative comments in the corpus.
Table 1. Dataset information