Deprecated: $wgMWOAuthSharedUserIDs=false is deprecated, set $wgMWOAuthSharedUserIDs=true, $wgMWOAuthSharedUserSource='local' instead [Called from MediaWiki\HookContainer\HookContainer::run in /var/www/html/w/includes/HookContainer/HookContainer.php at line 135] in /var/www/html/w/includes/Debug/MWDebug.php on line 372
Within-project-Defect-Prediction-for-Ansible - MaRDI portal

Within-project-Defect-Prediction-for-Ansible

From MaRDI portal
Dataset:6036460



OpenML43357MaRDI QIDQ6036460

OpenML dataset with id 43357

No author found.

Full work available at URL: https://api.openml.org/data/v1/download/22102182/Within-project-Defect-Prediction-for-Ansible.arff

Upload date: 23 March 2022



Dataset Characteristics

Number of classes: 0
Number of features: 113 (numeric: 109, symbolic: 0 and in total binary: 0 )
Number of instances: 227,272
Number of instances with missing values: 13,078
Number of missing values: 757,758

Context Infrastructure-as-code (IaC) is the DevOps strategy that allows management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC strategy in a measurable fashion. On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed. In particular, Infrastructure-as-Code is simply "code", as such it is prone to defects as any other programming languages. This dataset targets the YAML-based Ansible language to devise defects prediction approaches for IaC based on Machine-learning.

Content The dataset contains metrics extracted from 86 open-source GitHub repositories based on the Ansible language that satisfied the following criteria:

The repository has at least one push event to its master branch in the last six months; The repository has at least 2 releases; At least 11 of the files in the repository are IaC scripts; The repository has at least 2 core contributors; The repository has evidence of continuous integration practice, such as the presence of a .travis.yaml file; The repository has a comments ratio of at least 0.2; The repository has commit frequency of at least 2 per month on average; The repository has an issue frequency of at least 0.023 events per month on average; The repository has evidence of a license, such as the presence of a LICENSE.md file The repository has at least 190 source lines of code.

Metrics are grouped into three categories:

IaC-oriented: metrics of structural properties derived from the source code of infrastructure scripts; Delta: metrics that capture the amount of change in a file between two successive releases, collected for each IaC-oriented metric; Process: metrics that capture aspects of the development process rather than aspects about the code itself. Description of the process metrics in this dataset can be found here.

Acknowledgements Thanks to the open-source community.

Inspiration What source code properties and properties about the development process are good predictors of defects in Infrastructure-as-Code scripts?