Introduction to datasets
Five common public datasets are used for experiments (Table 1): 20NG is
a corpus containing 20 categories, with a total of 11,314 documents in
the training set and 7532 documents in the test set; R52 and R8 are two
subsets of the Reuters dataset (R8 has eight categories and is split
into 5485 training documents and 2189 test documents; R52 has 52
categories, divided into 6532 training documents and 2568 test
documents); Ohsumed is a database from the medical sciences that
contains 23 categories. Herein, 7400 documents are selected, among which
3357 documents are in the training set and 4043 documents are in the
test set. MR is a dataset of movie reviews for binary sentiment
classification, where each review contains only one sentence. There are
5331 positive and 5331 negative comments in the corpus.
Table 1. Dataset information