Mathematical Background
Topological and geometrical data analysis are fundamentally built on ideas and results of algebraic topology. Along with the persistent homology toolbox come rigorous mathematical theorems from general and algebraic topology, justifying the constructions undertaken. On this page, a brief introduction into their statements is given.
As is motivated below, there are three fundamental results that make the classical theory of persistent homology sensible. Along with giving these we introduce the relevant mathematical objects, mainly following along the routes of Carlsson 2009[1].
Introduction
The TDA procedure, abstractly
Data of interest is always given by means of point clouds, that is, finite subsets [math] X\subset \mathbb{R}^n [/math]. In order to use associate topological notions to [math]X[/math] it is necessary to first translate the data into topological spaces. This step usually requires the notion of a distance, that is, a metric on [math] \mathbb{R}^n [/math] which is tailored to the system from which the data arises. This way, geometry enters the stage: The distance of points in [math]\mathbb{R}^n[/math] is a notion that is usually not inherent to the data. Additionally, on different length scales different information may be encoded, which is why a priori there exists no distinguished value of a distance. Instead, unifying feature that persist over a broad range of length scales may be dominant, while others may be subdominant. The algebraic theory gives relations between different values of the parameter.
To this end, the applications of topological ideas to data analysis can almost always be divided into the following two steps $$ \text{Data} \stackrel{geometry}{\longrightarrow} \text{simplicial complex} \stackrel{algebraic\;topology}{\longrightarrow} \text{invariants inherent to the data set} $$
The need for theorems
Let's assume that our initial point cloud data [math]D=\{p_i \in \mathbb{R}^n\}[/math] arises as a finite subset of a paracompact topological space [math]X[/math] which is embedded in [math]\mathbb{R}^n[/math]. In topological data analysis we want to determine the characteristic properties of the topological space [math]X[/math] from the point cloud [math]D[/math].
For any value of [math]r[/math] the open balls centered at the points are a good open cover [math]\mathcal{U}_r = \{B_r(p_i)\}[/math] of the compact space [math]Y_r=\bigcup B_r(p_i)[/math]. Recall that the Čech complex [math]\check{\mathcal C}(D)[/math] is just the filtered complex of nerves [math]\check{\mathcal C}_r(D) = \{ N(\mathcal U_r), r\in [0,\infty) \}[/math], so by the nerve theorem there exists a homotopy equivalence [math]\check{\mathcal C}_r(D) \simeq Y_r[/math] for each [math]r[/math].
For the moment assume that there exists a value [math]r[/math], such that [math]X\cap\mathcal U_r[/math] is a good open cover of [math]X[/math]. Then at this value of [math]r[/math] the Čech complex is actually homotopy equivalent to [math]X[/math] due to the nerve theorem. This motivates the standard approach to topological data analysis, in which we assume that - generically - we are in a situation with the above assumptions holding.
However, it is not clear at which value of [math]r[/math] this situation occurs, which means that we need to consider the radius-dependent, 1-parameter family of Čech complex of [math]D[/math] as a whole. Then, the topological invariants of this filtered complex will contain characteristic properties of [math]X[/math] - in addition encoded in their radius-dependence. The homology groups of a 1-parameter family of filtered simplicial complexes actually have the structure of a persistence module. The structure theorem now tells us that we have a complete classification of these persistence modules by means of the birth and death values of homology classes, providing us with a useful and efficient structural description of multi-scale topological information.
Unfortunately, the situation described so far is not as we can expect to hold in typical data analysis scenarios. The hypotheses of [math]X[/math] being embedded and there being a radius [math]r[/math] that admits a good open cover of the embedding of [math]X[/math] by balls with this radius are problematic. Furthermore, measurements are always endowed with uncertainties, such that the points of [math]D[/math] do not necessarily lie on the image of [math]X[/math] in [math]\mathbb{R}^n[/math].
This is where the stability theorem enters, which makes precise the following statement: a small variation of the metric results in only a small variation of the associated persistence module. Although this result does not fully invalidate the objections of the previous paragraph, it is crucial for inferring topological information of [math]X[/math] from [math]D[/math].
The nerve theorem
Often it is desirable to construct simplicial complexes which compute the homology of an underlying space [math]X[/math] or are strongly related to it. To rigorously guarantee for this one may look for a homotopy equivalence from [math]X[/math] to the simplicial complex. Numerous simplicial complexes exist that may be associated to [math]X[/math]. A particularly simple and in our setting useful variant will be the nerve of a covering of a topological space.
Given the nerve theorem below, which provides to us such a homotopy equivalence, the need for methods to generate coverings remains. This directly leads to the construction of different complexes such as the Čech complex or the Alpha complex. This way, the nerve theorem is a cornerstone in describing the homotopy type of a topological space via a finite point cloud sampled from it and associated simplicial complexes.
The nerve of a covering
Let [math]X[/math] be a topological space, and let [math]\mathcal U=\{U_i\}[/math] be any covering of [math]X[/math]. The nerve of [math]\mathcal U[/math], denoted by [math]N\mathcal U[/math], is defined as the abstract simplicial complex with vertex set [math]A[/math], and where a family [math]\{i_0,\dots, i_k\}[/math] spans a [math]k[/math]-simplex if and only if [math]U_{i_0}\cap \dots \cap U_{i_k}\neq \emptyset[/math].
The nerve theorem
Let [math]X[/math] be a paracompact space and [math]\mathcal U=\{U_i\}[/math] a good open cover, that is every nonempty intersection of finitely many sets in [math]\mathcal U[/math] is contractible.
Then the nerve [math]N\mathcal U[/math] is homotopy equivalent to [math]X[/math].
The structure theorem
Given a family of simplicial complexes,
Persistence modules
Structure Theorem
Let [math]D[/math] be a finite subset of [math]X[/math]. Then the persistence module of homology classes of the Cech complex [math]\check{\mathcal C}(D)[/math] with coefficients in a field [math]k[/math] is isomorphic to $$ H_k(\mathcal C,k) \simeq \bigoplus_{i=1}^{r} I_{[b_i,d_i]} $$ where [math]I_{[b_i,d_i]}[/math] are the indecomposable persistence modules $$ I_{[b_i,d_i]} = \left\{ \begin{matrix} k & t\in[b_i,d_i] \\ 0 & \text{else } \end{matrix} \right. $$ and [math]\{[b_i,d_i], 1\leq i \leq r\}[/math] is a finite collection of intervals with [math]r\leq|D|[/math].
The stability theorem
Stability Theorem
Consider a point cloud [math]D\subset \mathbb{R}^n[/math]. Let [math]d[/math] and [math]d^\prime[/math] be two metrics on [math]\mathbb{R}^n[/math] and [math] \check{\mathcal C}(D;d), \check{\mathcal C}(D;d^\prime) [/math] the associated Chech complexes. Then their is a [math]c \in \mathbb{R}[/math], such that [citation needed] $$ d_I \left( \check{\mathcal C}(D;d), \check{\mathcal C}(D;d^\prime) \right)\leq c \lVert d - d^\prime \rVert_{D} $$ where [math] d_I (\cdot,\cdot ) [/math] is the interleaving distance of persistence modules and [math] \lVert d - d^\prime\rVert_{D} = \sum_{i,j} \lvert d(p_i,p_j)-d^\prime(p_i,p_j) \rvert [/math] is the variation of distance in [math]D[/math] with respect to [math]d, d^\prime[/math].
References
- ↑ CARLSSON, Gunnar. Topology and data. Bulletin of the American Mathematical Society, 2009, 46. Jg., Nr. 2, S. 255-308. doi:10.1090/S0273-0979-09-01249-X