The information geometry of UMAP (Q6619723)
From MaRDI portal
| This is the item page for this Wikibase entity, intended for internal use and editing purposes. Please use this page instead for the normal view: The information geometry of UMAP |
scientific article; zbMATH DE number 7927156
| Language | Label | Description | Also known as |
|---|---|---|---|
| English | The information geometry of UMAP |
scientific article; zbMATH DE number 7927156 |
Statements
The information geometry of UMAP (English)
0 references
16 October 2024
0 references
From the author's abstract: ``In this note we highlight some connections of UMAP to the basic principles of Information Geometry. Originally, UMAP was derived from Category Theory observations. However, we posit that it also has a natural geometric interpretation''. \N\NInformation geometry is a well-known interdisciplinary field that applies the techniques of differential geometry to study probability theory and statistics. It started in 1945 with a paper by \textit{C. R. Rao} [Bull. Calcutta Math. Soc. 37, 81--91 (1945; Zbl 0063.06420)]. It studies statistical manifolds, which are Riemannian manifolds whose points correspond to probability distributions. There is even a recent book in the field by \textit{S.-i. Amari} [Information geometry and its applications. Tokyo: Springer (2016; Zbl 1350.94001)]. As for ``Uniform Mapping and Projection'' (UMAP), there exists much less literature: there are only six papers indexed in zbMATH Open, including the present one, and only one with a review. The foundational paper [``UMAP: uniform manifold approximation and projection'', J. Open Source Soft. 3, No. 29, 861 (2018; \url{doi:10.21105/joss.00861})] by \textit{L. McInnes} et al. is not even indexed. Neither the survey article to which the authors refer to clarify about UMAP was found by the reviewer. \N\NFrom the introduction: ``The general setting of UMAP is that we are given a dataset \(X = \{x_i\}\) that is sampled from a compact oriented Riemannian manifold \((M,g)\) with boundary in \(\mathbb{R}^m\). Let \(d:M\times M\rightarrow \mathbb{R}_{\geq 0}\) be the geodesic path metric on \(M\). UMAP seeks to embed \(X\) into a lower-dimensional space \(\mathbb{R}^n\), with \(n\ll m\), as a set \(Y = \{y_i\}\subset \mathbb{R}^n\) such that the higher-dimensional proximity between points is preserved in and, moreover, visually revealed if \(n = 2\) or \(3\). Once an embedding \(Y \subset \mathbb{R}^n\) is obtained, one may consider \(n\) to be the actual dimension (or close to the actual dimension) of \(X \subset \mathbb{R}^m\). If \(X\) approximates \(M\) in any reasonable sense, we may also think of UMAP embedding \(M\) in \(\mathbb{R}^n\) instead of \(\mathbb{R}^m\), with \(n\ll m\).''\N\NThe authors consider the \(kNN\)-graph created on the points of \(X\) and for this conformal rescaling, high-dimensional probabilities, low-dimensional probabilities and cross-entropy minimisation, which ``provide a very basic description of what UMAP does.'' Regarding the implementation of this algorithm, the authors mention a paper of \textit{S. Damrich} and \textit{F. A. Hamprecht} [Adv. Neut. Inf. Proc. Syst. 34, 5798--5809 (2021)], based on the stochastic gradient descent. But the authors write: ``In our study, however, we concentrate only on theoretical aspects and possible generalisations of underlying ideas''.\N\NThe second part of the paper, concerning UMAP, has five sections.\N\N(1) Uniformity assumption. To understand the algorithm, it is assumed that ``the data \(X\) is uniformly distributed on a Riemannian manifold \(M\) with metric tensor \(g\) having volume forms \(\omega\) then, away from the boundary \(\partial M\), any ball \(B_i\) of fixed volume centered at \(X_i\) should contain the same number of points of \(X\) regardless of where on \(M\) it is centred. Given finite data and small enough local neighbourhoods, this approximation may be accurate enough even for data samples near \(\partial M\). Conversely, the ball \(B_i\) centred at \(X_i\) that contains exactly \(k\) nearest neighbours of \(X_i\) should have approximately the same volume regardless of the choice of \(X_i \in X\). In essence, by creating a custom distance in the neighbourhood of each \(X_i\) we can ensure the validity of the assumption of uniform distribution on the manifold by rescaling each ball \(B_i\), though the challenge remains to patch the different \(B_i\)'s together. If the balls \(B_i\) do not have major overlaps, we may assume that the rescaling is close to conformal, i.e. that it preserves angles, as inside each ball a simple homothety is performed. This is what is done in UMAP while the higher-dimensional edge weights are computed: we would like to single out this step because of its importance in the understanding and functioning of the entire algorithm''. \N\NFor this part, the authors refer to the paper by {L. McInnes} [loc. cit.]. In Riemannian geometry there exists a volume-preserving diffeomorphism \(\phi:M\rightarrow M\) such that \(\phi(M)\) admits a volume form with constant density. Replacing \(M\) with \(\phi(M)\) having \(\phi^{*}g\) as metric tensor and \(\phi^{*}\omega\) as volume forme allows to perform uniform sampling. But the diffeomorphism \(\phi\) does not have to be conformal. Moreover, even if the points of \(X\) approximate \(M\) (e.g., in the Hausdorff distance), it is not certain that their distribution is uniform within each ball \(B_i\) of \(M\), and even more so on \(M\) itself. ``The failure of the uniformity assumption will not preclude UMAP from constructing an embedding''.\N\N(2) High-dimensional probabilities. ``The high-dimensional probabilities serve the purpose of recovering the weighted nearest neighbour graph as an approximation on the dataset's local geometry''. This section has two subsections: 2.2.1.Probabilistic \(kNN\)-graphs, and 2.2.2 Probability kernels. ``The original ``fuzzy'' membership function used to define high-dimensional probabilities in UMAP seems to be just one of many possible probability kernels that might be used depending on the circumstances''. In this part the authors use the databases Iris, MINST, HDBSCAN and GitHub.\N\N(3) Low-dimensional probabilities. This is a reference for the use of Student's \(t\)-distribution for low-dimensional spaces with ambient dimension \(n\leq 3\).\N\N(4) On the equivalence of cross-entropy and KL-divergence. ``The Kulback-Leibler divergence is widely used in information geometry as a distance function which is, however, not symmetric and thus cannot represent a metric. Another measure of difference of one distribution from the other is their cross-entropy, which is widely used in practical tasks''.\N\N(5) Future research: Vietoris-Rips complexes.\N\NIn the conclusion the authors write ``We posit that Information Geometry can be the right framework for understanding UMAP''.
0 references
information geometry
0 references
dimensionality reduction
0 references
clustering
0 references
UMAP
0 references
cross-entropy
0 references
Kullback-Leibler divergence
0 references