Deprecated: $wgMWOAuthSharedUserIDs=false is deprecated, set $wgMWOAuthSharedUserIDs=true, $wgMWOAuthSharedUserSource='local' instead [Called from MediaWiki\HookContainer\HookContainer::run in /var/www/html/w/includes/HookContainer/HookContainer.php at line 135] in /var/www/html/w/includes/Debug/MWDebug.php on line 372
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" - MaRDI portal

Deprecated: Use of MediaWiki\Skin\SkinTemplate::injectLegacyMenusIntoPersonalTools was deprecated in Please make sure Skin option menus contains `user-menu` (and possibly `notifications`, `user-interface-preferences`, `user-page`) 1.46. [Called from MediaWiki\Skin\SkinTemplate::getPortletsTemplateData in /var/www/html/w/includes/Skin/SkinTemplate.php at line 691] in /var/www/html/w/includes/Debug/MWDebug.php on line 372

Deprecated: Use of MediaWiki\Skin\BaseTemplate::getPersonalTools was deprecated in 1.46 Call $this->getSkin()->getPersonalToolsForMakeListItem instead (T422975). [Called from Skins\Chameleon\Components\NavbarHorizontal\PersonalTools::getHtml in /var/www/html/w/skins/chameleon/src/Components/NavbarHorizontal/PersonalTools.php at line 66] in /var/www/html/w/includes/Debug/MWDebug.php on line 372

Deprecated: Use of QuickTemplate::(get/html/text/haveData) with parameter `personal_urls` was deprecated in MediaWiki Use content_navigation instead. [Called from MediaWiki\Skin\QuickTemplate::get in /var/www/html/w/includes/Skin/QuickTemplate.php at line 131] in /var/www/html/w/includes/Debug/MWDebug.php on line 372

Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study" (Q6701237)

From MaRDI portal





Dataset published at Zenodo repository.
Language Label Description Also known as
English
Dataset of the paper: "How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study"
Dataset published at Zenodo repository.

    Statements

    0 references
    This replication package contains datasets and scripts related to the paper: "*How do Hugging Face Models Document Datasets, Bias, and Licenses? An Empirical Study*"## Root directory- `statistics.r`: R script used to compute the correlation between usage and downloads, and the RQ1/RQ2 inter-rater agreements- `modelsInfo.zip`: zip file containing all the downloaded model cards (in JSON format)- `script`: directory containing all the scripts used to collect and process data. For further details, see README file inside the script directory.## Dataset- `Dataset/Dataset_HF-models-list.csv`: list of HF models analyzed- `Dataset/Dataset_github-prj-list.txt`: list of GitHub projects using the *transformers* library- `Dataset/Dataset_github-Prj_model-Used.csv`: contains usage pairs: project, model- `Dataset/Dataset_prj-num-models-reused.csv`: number of models used by each GitHub project- `Dataset/Dataset_model-download_num-prj_correlation.csv` contains, for each model used by GitHub projects: the name, the task, the number of reusing projects, and the number of downloads## RQ1- `RQ1/RQ1_dataset-list.txt`: list of HF datasets- `RQ1/RQ1_datasetSample.csv`: sample set of models used for the manual analysis of datasets- `RQ1/RQ1_analyzeDatasetTags.py`: Python script to analyze model tags for the presence of datasets. it requires to unzip the `modelsInfo.zip` in a directory with the same name (`modelsInfo`) at the root of the replication package folder. Produces the output to stdout. To redirect in a file fo be analyzed by the `RQ2/countDataset.py` script- `RQ1/RQ1_countDataset.py`: given the output of `RQ2/analyzeDatasetTags.py` (passed as argument) produces, for each model, a list of Booleans indicating whether (i) the model only declares HF datasets, (ii) the model only declares external datasets, (iii) the model declares both, and (iv) the model is part of the sample for the manual analysis- `RQ1/RQ1_datasetTags.csv`: output of `RQ2/analyzeDatasetTags.py`- `RQ1/RQ1_dataset_usage_count.csv`: output of `RQ2/countDataset.py`## RQ2- `RQ2/tableBias.pdf`: table detailing the number of occurrences of different types of bias by model Task- `RQ2/RQ2_bias_classification_sheet.csv`: results of the manual labeling- `RQ2/RQ2_isBiased.csv`: file to compute the inter-rater agreement of whether or not a model documents Bias- `RQ2/RQ2_biasAgrLabels.csv`: file to compute the inter-rater agreement related to bias categories- `RQ2/RQ2_final_bias_categories_with_levels.csv`: for each model in the sample, this file lists (i) the bias leaf category, (ii) the first-level category, and (iii) the intermediate category## RQ3- `RQ3/RQ3_LicenseValidation.csv`: manual validation of a sample of licenses- `RQ3/RQ3_{NETWORK-RESTRICTIVE|RESTRICTIVE|WEAK-RESTRICTIVE|PERMISSIVE}-license-list.txt`: lists of licenses with different permissiveness- `RQ3/RQ3_prjs_license.csv`: for each project linked to models, among other fields it indicates the license tag and name- `RQ3/RQ3_models_license.csv`: for each model, indicates among other pieces of info, whether the model has a license, and if yes what kind of license- `RQ3/RQ3_model-prj-license_contingency_table.csv`: usage contingency table between projects' licenses (columns) and models' licenses (rows)- `RQ3/RQ3_models_prjs_licenses_with_type.csv`: pairs project-model, with their respective licenses and permissiveness level## scriptsContains the scripts used to mine Hugging Face and GitHub. Details are in the enclosed README
    0 references
    31 October 2023
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references

    Identifiers

    0 references