Comprehensive-database-of-Minerals
OpenML dataset with id 43356
No author found.
Full work available at URL: https://api.openml.org/data/v1/download/22102181/Comprehensive-database-of-Minerals.arff
Upload date: 23 March 2022
Dataset Characteristics
Number of features: 140 (numeric: 139, symbolic: 0 and in total binary: 0 )
Number of instances: 3,112
Number of instances with missing values: 0
Number of missing values: 0
This dataset is the collection of 3112 minerals, their chemical compositions, crystal structure, physical and optical properties. The properties that are included in this database are the Crystal structure, Mohs Hardness, Refractive Index, Optical axes, Optical Dispersion, Molar Volume, Molar, Mass, Specific Gravity, and Calculated Density.
Introduction
The term dielectric is applied to a class of materials - usually solids - that are poor conductors of
electricity. Dielectrics are of significant technological and industrial importance, being essential
functional components of almost all electronic devices. For most of these applications, they are
required to be mechanically tough and thermally robust. The defining physical attribute of a
dielectric is electric polarizability which is the tendency for charges to be non-uniformly
distributed across a chemical bond. Most dielectrics contain dipoles due to their ionic bonds or
covalent bonds with strong ionic nature. At a macroscopic scale, this implies that an external
electric field can interact with these charges and result in various optical and electric phenomena.
Optically, dielectrics can be transparent, opaque, or vitreous. They can also be isotropic,
biaxial, or fully anisotropic. The luster of gem minerals such as emerald, sapphire, and ruby is due to
their high refractive index which causes white light to be split into its components. The presence
of two refractive indices in a material can result in an incident beam being split into two rays
that interfere with each other. This common phenomenon is called Birefringence. These effects
are made use of in many commercially important applications such as transparent conductive
oxides, liquid crystal displays, medical diagnostics, stress sensing, light modulation, etc.
As an example, transparent conducting oxides (TCO) are derived from dielectrics by doping oxides
with impurity atoms. TCOs do not absorb light in the visible spectrum rendering them transparent
and are also conductors of charge. The most important application of TCOs is as the top
electrode of solar cells where they allow light to fall on a semiconducting layer while capturing
the released hole/electron to generate current. Airplane windshields have a thin coating of a
TCO material on them that is used to generate heat by passing a current. This is necessary to
keep the glass defrosted allowing the pilot visibility to navigate. Other applications of TCOs is
as substrates in electronics, flexible displays, high definition TVs, and the screens of mobile smart
devices.
The figure for merit for optical phenomena is the refractive index, which is defined as the ratio
of the speed of light in the medium to the speed of light in vavacuum.
Provenance of Data
The list of minerals with individual pages in Wikipedia is given at:
https://en.wikipedia.org/wiki/List_of_minerals. The get method of the requests library is used
to retrieve this page and the content is parsed using BeautifulSoup a python library specifically
engineered for parsing html and lxml content. The URLs for all the minerals given in this page is
extracted using their href attribute and are stored in a dictionary, along with the mineral name.
Each of the webpages has textual information on the mineral (origin, etymology, variety, history
etc.), images (cleavages, and other data) as well as an Infobox on the right that tabulates some
common mineral properties such as category, formula, strunz classification, crystal structure,
unit cell, Mohs hardness, color, cleavage, fracture, luster, diaphaneity, specific gravity, optical
properties a and refractive index. The soup object for the page is retrieved and the table element
with class name infobox is extracted. The specified row heading and row data are then read into
a dictionary which is wrapped in a class object. A class method writes this data into a csv file while
another method writes the text from the webpage into a text file.
The American Mineralogist Crystal Structure Database at
http://rruff.geo.arizona.edu/AMS/amcsd.php has a list of over 4000 minerals with their cif files.
The name and the URL of all these minerals are found at http://rruff.geo.arizona.edu/AMS. From
here, each mineral name and the corresponding URL is extracted using the approach outlined
above. Accessing each page, we find the crystallographic information of the mineral. The a,b,c edge
lengths and alpha, beta, gamma - unit cell angles are given at the top followed by a list of
all atoms and their x,y,z positions. The header is extracted and stored in a pandas dataframe
while the atomic species and their positions are saved into a separate CSV file. This is repeated
for all the 4000 minerals. Before inclusion into the machine learning stage of this study, each of these
cif files are read and parsed into a vector with each cell corresponding to an element of the periodic table and the number
of atoms of the element in the formula is counted as the cell value. This is detailed further in
the data processing part of the project.
Compared to other properties, dispersion of minerals has been hard to find. Dispersion values of
60 minerals found at: http://gemologyproject.com/wiki.
The chemical formula, molar mass, molar volume, and calculated density are available for all minerals. The availability of other properties vary.
Chemical Formula
The chemical formula has been parsed so that the number of each element has been separated tabulated. For example, the mineral Quartz has the formula 'SiO2' - so that the corresponding entry for the column 'Silicon' is 1 and the entry for 'Oxygen' is 2. The entries for all the other elements are 0.
In this way, the chemical formula for each mineral is converted into a vector where each column corresponds to an element in the periodic table and the value corresponds to the number of atoms of the element in a formula unit of the mineral.
In addition to the pure elements, ionic species such as carbonate, phosphate, nitrate, cyanide, hydrated water, etc are also counted separately.
Molar Mass
The molar mass of the mineral is calculated by adding together the mass of each atom in a mole of the mineral.
Molar mass = Summation( no of atoms * mass of each atom)
Molar Volume
The molar volume of the mineral is calculated by adding together the volume of each atom in a mole of the mineral.
Molar volume = Summation( no of atoms * volume of each atom)
Refractive Index
The refractive index of the mineral is defined as the ratio of the speed of light in the mineral to the speed of light in free space.
This is a function of the frequency of light. The RI of blue light is not the same as the RI of red light in the same mineral. This variation is measured by 'dispersion'.
Mohs Hardness
Mohs hardness is a qualitative measure for the hardness of a mineral that is frequently used by
the geologist. Diamond (hardest mineral) is given the highest value of 10 and talc (softest
mineral) is given the value of 1. A mineral that can scratch a second mineral has a higher Mohs
hardness. In this way, all the minerals can be ranked on a relative scale of hardness. It is not exactly clear what physical parameter is represented by the Mohs Hardness. Several absolute scales for hardness such as toughness, yield strength, etc. are known from the mechanics of materials, however, none of them seem to correspond exactly to Mohs Hardness. However, this remains a very intuitive way to understand the physical property of a material.
This page was built for dataset: Comprehensive-database-of-Minerals