Deprecated: $wgMWOAuthSharedUserIDs=false is deprecated, set $wgMWOAuthSharedUserIDs=true, $wgMWOAuthSharedUserSource='local' instead [Called from MediaWiki\HookContainer\HookContainer::run in /var/www/html/w/includes/HookContainer/HookContainer.php at line 135] in /var/www/html/w/includes/Debug/MWDebug.php on line 372
Hierarchical Text Classification corpora - MaRDI portal

Deprecated: Use of MediaWiki\Skin\SkinTemplate::injectLegacyMenusIntoPersonalTools was deprecated in Please make sure Skin option menus contains `user-menu` (and possibly `notifications`, `user-interface-preferences`, `user-page`) 1.46. [Called from MediaWiki\Skin\SkinTemplate::getPortletsTemplateData in /var/www/html/w/includes/Skin/SkinTemplate.php at line 691] in /var/www/html/w/includes/Debug/MWDebug.php on line 372

Deprecated: Use of QuickTemplate::(get/html/text/haveData) with parameter `personal_urls` was deprecated in MediaWiki Use content_navigation instead. [Called from MediaWiki\Skin\QuickTemplate::get in /var/www/html/w/includes/Skin/QuickTemplate.php at line 131] in /var/www/html/w/includes/Debug/MWDebug.php on line 372

Hierarchical Text Classification corpora

From MaRDI portal



DOI10.5281/zenodo.7319519Zenodo7319519MaRDI QIDQ6718360

Dataset published at Zenodo repository.

Author name not available (Why is that?)

Publication date: 14 December 2022

Copyright license: No records found.



A set of 3 datasets for Hierarchical Text Classification (HTC), with samples divided into training and testing splits. The hierarchies of labels within all datasets have depth 2. The Amazon5x5 dataset contains 500,000 user reviews tagged with the reviewed product's categories. There are 5 product categories with 100,000 examples each, and each category has 5 sub-categories. The Bugs dataset contains 30,050 bugs of the Linux kernel, labeled with exactly two categories identifying the affected component. Finally, the Web Of Science dataset contains 46,960 abstracts of scientific papers, labeled the article's domain (see original repo for more details). Datasets are published in JSONL format, where each line is a string formatted as a JSON, like in the example below. { "text": article text, "labels": [label1, label2, ...] } The hierarchical structure of labels in each dataset is documented in this repository. These datasets have been presented in this paper: "Hierarchical Text Classification and its Foundations: a Review of Current Research" - DOI: 10.3390/electronics13071199 Some of these datasets have also been used in: "Ticket Automation: an Insight into Current Research with Applications to Multi-level Classification Scenarios" - DOI: 10.1016/j.eswa.2023.119984 "A multi-level approach for hierarchical Ticket Classification", accepted at WNUT 2022 - link These datasets are partially derived from previous work, namely: [Amazon] J. Ni, J. Li, J. McAuley, "Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects", EMNLP 2019, doi: 10.18653/v1/D19-1018 [WOS] K. Kowsari, D. E. Brown, M. Heidarysafa, K. Jafari Meimandi, M. S. Gerber and L. E. Barnes, "HDLTex: Hierarchical Deep Learning for Text Classification," 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), 2017, pp. 364-371, doi: 10.1109/ICMLA.2017.0-134 [Linux Bugs] V. Lyubinets, T. Boiko and D. Nicholas, "Automated Labeling of Bugs and Tickets Using Attention-Based Mechanisms in Recurrent Neural Networks," 2018 IEEE Second International Conference on Data Stream Mining Processing (DSMP), 2018, pp. 271-275, doi: 10.1109/DSMP.2018.8478511






This page was built for dataset: Hierarchical Text Classification corpora