Microsoft

OpenML dataset with id 45579

No author found.

Full work available at URL: https://api.openml.org/data/v1/download/22116700/Microsoft.arff

Upload date: 5 July 2023

Dataset Characteristics

Number of classes: 5
Number of features: 137 (numeric: 136, symbolic: 1 and in total binary: 0 )
Number of instances: 1,200,192
Number of instances with missing values: 0
Number of missing values: 0

Description

Microsoft Learning to Rank Datasets

Dataset Descriptions

The datasets are machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels:

(1) The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant).

(2) The features are basically extracted by us, and are those widely used in the research community.

In the data files, each row corresponds to a query-url pair. The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector.

Below are two rows from MSLR-WEB10K dataset:

==================================

0 qid:1 1:3 2:0 3:2 4:2 ... 135:0 136:0

2 qid:1 1:3 2:3 3:0 4:0 ... 135:0 136:0

==================================

Dataset Partition

We have partitioned each dataset into five parts with about the same number of queries, denoted as S1, S2, S3, S4, and S5, for five-fold cross validation. In each fold, we propose using three parts for training, one part for validation, and the remaining part for test (see the following table). The training set is used to learn ranking models. The validation set is used to tune the hyper parameters of the learning algorithms, such as the number of iterations in RankBoost and the combination coefficient in the objective function of Ranking SVM. The test set is used to evaluate the performance of the learned ranking models.

Folds Training Set Validation Set Test Set Fold1 {S1,S2,S3} S4 S5 Fold2 {S2,S3,S4} S5 S1 Fold3 {S3,S4,S5} S1 S2 Fold4 {S4,S5,S1} S2 S3 Fold5 {S5,S1,S2} S3 S4

Reference

You can cite this dataset as below.

``` @article{DBLP:journals/corr/QinL13,

 author    = {Tao Qin and
              Tie{-}Yan Liu},
 title     = {Introducing {LETOR} 4.0 Datasets},
 journal   = {CoRR},
 volume    = {abs/1306.2597},
 year      = {2013},
 url       = {http://arxiv.org/abs/1306.2597},
 timestamp = {Mon, 01 Jul 2013 20:31:25 +0200},
 biburl    = {http://dblp.uni-trier.de/rec/bib/journals/corr/QinL13},
 bibsource = {dblp computer science bibliography, http://dblp.org}

} ```

Note:

This is a learning-to-rank dataset and it should not be used for standard classification tasks. It is only coded this way to enable reproducing the work "Tabular data: Deep learning is not all you need" by Shwartz-Ziv and Amitai Armon.
This dataset concatenats the train, valid and test set from Fold1.
This is the 10k Version (Web10k)
The uploader shortened the word "variance" in the feature names to "var" to comply with OpenML's maximum feature name length.

This page was built for dataset: Microsoft