Deprecated: $wgMWOAuthSharedUserIDs=false is deprecated, set $wgMWOAuthSharedUserIDs=true, $wgMWOAuthSharedUserSource='local' instead [Called from MediaWiki\HookContainer\HookContainer::run in /var/www/html/w/includes/HookContainer/HookContainer.php at line 135] in /var/www/html/w/includes/Debug/MWDebug.php on line 372
1-million-Reddit-comments-from-40-subreddits - MaRDI portal

Deprecated: Use of MediaWiki\Skin\SkinTemplate::injectLegacyMenusIntoPersonalTools was deprecated in Please make sure Skin option menus contains `user-menu` (and possibly `notifications`, `user-interface-preferences`, `user-page`) 1.46. [Called from MediaWiki\Skin\SkinTemplate::getPortletsTemplateData in /var/www/html/w/includes/Skin/SkinTemplate.php at line 691] in /var/www/html/w/includes/Debug/MWDebug.php on line 372

Deprecated: Use of MediaWiki\Skin\BaseTemplate::getPersonalTools was deprecated in 1.46 Call $this->getSkin()->getPersonalToolsForMakeListItem instead (T422975). [Called from Skins\Chameleon\Components\NavbarHorizontal\PersonalTools::getHtml in /var/www/html/w/skins/chameleon/src/Components/NavbarHorizontal/PersonalTools.php at line 66] in /var/www/html/w/includes/Debug/MWDebug.php on line 372

Deprecated: Use of QuickTemplate::(get/html/text/haveData) with parameter `personal_urls` was deprecated in MediaWiki Use content_navigation instead. [Called from MediaWiki\Skin\QuickTemplate::get in /var/www/html/w/includes/Skin/QuickTemplate.php at line 131] in /var/www/html/w/includes/Debug/MWDebug.php on line 372

1-million-Reddit-comments-from-40-subreddits

From MaRDI portal
Dataset:6036602



OpenML43504MaRDI QIDQ6036602

OpenML dataset with id 43504

Author name not available (Why is that?)

Full work available at URL: https://api.openml.org/data/v1/download/22102329/1-million-Reddit-comments-from-40-subreddits.arff

Upload date: 23 March 2022



Dataset Characteristics

Number of features: 4 (numeric: 2, symbolic: 0 and in total binary: 0 )
Number of instances: 1,000,000
Number of instances with missing values: 1
Number of missing values: 1

Content This data is an extract from a bigger reddit dataset (All reddit comments from May 2019, 157Gb or data uncompressed) that contains both more comments and more associated informations (timestamps, author, flairs etc). For ease of use, I picked the first 25 000 comments for each of the 40 most frequented subreddits (May 2019), this was if anyone wants to us the subreddit as categorical data, the volumes are balanced. I also excluded any removed comments / comments whose author got deleted and comments deemed too short (less than 4 tokens) and changed the format (json - csv). This is primarily a NLP dataset, but in addition to the comments I added the 3 features I deemed the most important, I also aimed for feature type variety. The information kept here is:

subreddit (categorical): on which subreddit the comment was posted body (str): comment content controversiality (binary): a reddit aggregated metric score (scalar): upvotes minus downvotes

Acknowledgements The data is but a small extract of what is being collected by pushshift.io on a monthly basis. You easily find the full information if you want to work with more features and more data. What can I do with that? Have fun! The variety of feature types should allow you to gain a few interesting insights or build some simple models. Note If you think the License (CC0: Public Domain) should be different, contact me






This page was built for dataset: 1-million-Reddit-comments-from-40-subreddits