Dexter

OpenML dataset with id 4136

No author found.

Full work available at URL: https://api.openml.org/data/v1/download/1681111/Dexter.sparse_arff

Upload date: 8 November 2015

Dataset Characteristics

Number of classes: 2
Number of features: 20,001 (numeric: 20,000, symbolic: 1 and in total binary: 1 )
Number of instances: 600
Number of instances with missing values: 0
Number of missing values: 0

Description

DEXTER is a text classification problem in a bag-of-word representation. This is a two-class classification problem with sparse continuous input variables. This dataset is one of five datasets of the NIPS 2003 feature selection challenge.

Source:

a. Original owners The original data set we used is a subset of the well-known Reuters text categorization benchmark. The data was originally collected and labeled by Carnegie Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization system. It is hosted by the UCI KDD repository: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html. David D. Lewis is hosting valuable resources about this data (see http://www.daviddlewis.com/resources/testcollections/reuters21578/). We used the “corporate acquisition” text classification class pre-processed by Thorsten Joachims <thorsten '@' joachims.org>. The data is one of the examples of the software package SVM-Light., see http://svmlight.joachims.org/. The example can be downloaded from ftp://ftp-ai.cs.uni-dortmund.de/pub/Users/thorsten/svm_light/examples/example1.tar.gz.

b. Donor of database This version of the database was prepared for the NIPS 2003 variable and feature selection benchmark by Isabelle Guyon, 955 Creston Road, Berkeley, CA 94708, USA (isabelle '@' clopinet.com).

Data Set Information:

The original data were formatted by Thorsten Joachims in the “bag-of-words” representation. There were 9947 features (of which 2562 are always zeros for all the examples) representing frequencies of occurrence of word stems in text. The task is to learn which Reuters articles are about 'corporate acquisitions'. We added a number of distractor feature called 'probes' having no predictive power. The order of the features and patterns were randomized.

DEXTER -- Positive ex. -- Negative ex. -- Total Training set --150 -- 150 -- 300 Validation set -- 150 -- 150 -- 300 Test set -- 1000 -- 1000 -- 2000 All -- 1300 -- 1300 -- 2600

Number of variables/features/attributes: Real: 9947 Probes: 10053 Total: 20000

This page was built for dataset: Dexter