Functional Diffusion Maps

An extension of the nonlinear dimensionality reduction method Diffusion Maps to functional data.

Eduardo Terres Caballero
👤 GitHub profile

State of the art: mathematical framework

In this chapter, we first provide an introduction to functional data and its representation. Then, we discuss the state of the art regarding the dimensionality reduction technique functional diffusion maps by first introducing the classical method of diffusion maps and then extending the framework to functional data.

Representing functional data

Let $\mathcal{X} = \{x_i(t)\}_{i=1}^{N}$ be a set of $N$ functional observations, where $x_i(t)$ is the observation of an independent functional variable $X_i$. Each functional observation is a function of the following form, with $p, q \ge 1$,

\[\begin{aligned} \label{sa:fd:functional_observation} x_i: \mathcal{T} \subset {\mathbb{R}}^p \to {\mathbb{R}}^q. \end{aligned}\] In theory, each observation takes values in an infinite number of points, but given the limited capacity of data gathering techniques, we may assume that the $N$ functional observations have been measured in a grid of points ${\textbf{t}}= (t_1, \cdots, t_M) \in {\mathcal{T}}^M$. Note that the dimension of the space ${\mathcal{T}}$ is the number of coordinates of each grid point. Thus, a sufficiently large M is desired in order to capture the true essence of the examined phenomenon. For the same reason, the grid points, ${\textbf{t}}$, shall be adequately distributed in order to precisely portray areas of the domain that are considered more important or where the observed functional variable is expected to have a more complex behavior. This effort may require specific knowledge of the problem at hand. In this context, the information that characterizes the functional observation $x_i(t)$ is precisely
$(x_i(t_1), \cdots, x_i(t_M))$.

As a way of exemplifying the representation of functional data using a grid, we may take a close look into the Phoneme dataset, represented in Figure ?. In particular, we have taken $N=500$ functional observations representing the distribution of the different frequencies in a set of signals. Each observation is categorized into one of five phonemes and is defined over the same grid of 50 points.

Subset containing 500 functional observations of the phoneme dataset, each described using 50 grid points. The y axis represents the computed **log-periodogram** of the recorded speech frames over the different **frequencies**, in kHz, represented in the x axis. One functional observation of each phoneme category is highlighted. Source: https://hastie.su.domains/ElemStatLearn/, accessed 10 June 2024.

The phonemes are transcribed as follows: /sh/ as in she, /dcl/ as in dark, /iy/ as the vowel in she, /aa/ as the vowel in dark, and /ao/ as the first vowel in water.

Alternatively, another way of representing functional data is through a basis representation. In this context, we define a basis of functions, $\beta = \{\phi_j(t)\}_{j = 1}^{\infty}$, each of the form $\phi_j: \mathcal{T} \subset {\mathbb{R}}^p \to {\mathbb{R}}^q$, with $p, q \ge 1$, and represent the $i$-th functional observation as

\[\begin{aligned} x_i(t) = \sum_{j = 1}^{\infty} c_{ij} \phi_j(t). \end{aligned}\] When choosing a basis, it is important to pay attention to the nature of the data so the resulting functional observations capture the fullest extent of expression. For example, if the data is periodic, a Fourier basis will be an interesting choice, otherwise, the B-Spline is a common choice .

Diffusion Maps

Functional diffusion maps is an extension to functional data of diffusion maps, a non-linear dimensionality reduction method for multivariate data introduced by Coifman and Lafon in 2004. Most of the mathematical foundation of diffusion maps still holds for its functional counterpart. Thus, we may at first study the underpinnings of the original method prior to extending it to functional data. This section follows Appendix A and Section 2 of reference .

Let $\mathcal{X} = \{x_i\}_{i=1}^{N}$ be a set of $N$ multivariate observations. Let us further assume that $\mathcal{X}$ lays over a manifold $\mathcal{M} \subset {\mathbb{R}}^D$. The goal is to derive a representation of the dataset $\mathcal{X}$ in a low-dimensional space, ${\mathbb{R}}^{d}$ where $d \ll D$, with the driving idea that the data naturally belongs to this lower dimensional space and was initially represented using an unnaturally large number of dimensions . This new representation maintains certain properties of interest for our study, mapping points close to each other in the euclidean distance that were considered close in the initial representation space under an unknown and more complex metric .

The first step of the method is to build a graph including a node for each data point and a weight between each pair of nodes whose value resembles how similar these points are. This notion of similarity between observations is given by a kernel, which is basically a function that takes two $D$-dimensional points as input and outputs a real number indicating the relationship between both points. Mathematically, the kernel $k$ is defined as follows and meets the following restrictions:

\[\begin{aligned} \begin{split} &k: {\mathbb{R}}^D \times {\mathbb{R}}^D \to {\mathbb{R}}\\ &\text{$k$ is symmetric:} \; k(x, y) = k(y, x) \qquad \forall \, x, y \in {\mathbb{R}}^D \\ &\text{$k$ is positivity preserving:} \; k(x, y) \ge 0 \qquad \forall \, x, y \in {\mathbb{R}}^D \end{split} \end{aligned}\]

The kernel defines the similarity in the space of the data and constitutes the initial knowledge that is given to the problem. Consequently, the election of the kernel has a direct impact on which features will have a greater influence in the later stages of the method. In this sense, high correlation values are the sole meaningful information within the dataset . Having defined the kernel, we can build a matrix $K = (k_{ij})_{1 \le i,j \le N} \in {\mathbb{R}}^{N \times N}$, where $N$ is the number of observations and the components are $k_{ij} = k(x_i, x_j)$, i.e., the similarity between the points $x_i$ and $x_j$. Note that $K$ is a symmetric matrix as a result of the symmetric property of the kernel $k$. This matrix is now used for defining the weighted graph $G = (\mathcal{X}, K)$.

The following step in the method is to define a random walk over the graph $G$. To this end, it is necessary to conduct a row stochastic normalization of matrix $K$, meaning that every row of $K$ must add up to one. Hence, the elements of row $j$ can be interpreted as the probability of jumping from datum $x_j$ to datum $x_i$. As a part of the normalization process, a correction is made to reduce the influence in the method of how the data has been distributed. In other words, for every given pair we add information on the similarities between these two points and the rest of the dataset. Thus, for $\alpha > 0$ we define the matrix $K^{(\alpha)} = (k^{(\alpha)}_{ij})_{1 \le i,j \le N} \in {\mathbb{R}}^{N \times N}$ as follows

\[\begin{aligned} k^{(\alpha)}_{ij} = \frac{k_{ij}}{d_i^{\alpha}d_j^{\alpha}} \quad \text{, where} \quad d_i = \sum_{j=1}^N k_{ij}. \end{aligned}\] Once this correction has been performed, it remains to normalize the rows of $K^{(\alpha)}$ so that the elements can be interpreted as a probability. In this regard, we define the transition matrix of a Markov process $P = (p_{ij})_{1 \le i,j \le N} \in {\mathbb{R}}^{N \times N}$ as

\[\begin{aligned} p_{ij} = \frac{k^{(\alpha)}_{ij}}{d_i^{(\alpha)}} \quad \text{, where} \quad d^{(\alpha)}_i = \sum_{j=1}^N k^{(\alpha)}_{ij}. \end{aligned}\]

A diffusion distance $D_T^2(x_i, x_j)$ is defined for each length of path, $T \in \mathbb{N}$, in the graph. The notion of proximity given by this distance is, by definition, a notion of how strongly the points are connected in the graph. Hence, an small diffusion distance between points $x_i$ and $x_j$ may be interpreted as a high probability of transitioning from $x_i$ to $x_j$ in the random walk. However, this distance is difficult to calculate so we resort to the eigenanalysis of the transition matrix $P$. Along this line, we calculate the right eigenvalues $\{\lambda_l\}_{l \ge 0}$ and right eigenvectors $\{\psi_l\}_{l \ge 0}$ of $P$, such that $1 = \lambda_0 > |\lambda_1| \ge |\lambda_2| \ge \cdots$ and it holds that $P\psi_l = \lambda_l \psi_l$. The diffusion distance that we defined earlier can be approximated using the first $L$ eigenvalues and right eigenvectors of $P$,

\[\begin{aligned} D_T (x_i, x_j) \approx \left( \sum_{l=1}^L \lambda_l^{2T} \left(\psi_l (x_i) - \psi_l (x_j) \right)^2 \right)^{\frac{1}{2}}. \end{aligned}\] Ultimately, the diffusion maps are defined as follows. For $i \in \{1, \cdots, N\}\,$,

\[\begin{aligned} \Psi_{T} (x_i) = \begin{pmatrix} \lambda_1^{T} \psi_1 (x_i) \\ \lambda_2^{T} \psi_2 (x_i) \\ \vdots \\ \lambda_L^{T} \psi_L (x_i) \\ \end{pmatrix}. \end{aligned}\] The driving idea in this definition is that the diffusion distance of the original space can be approximated by the euclidean distance in the space ${\mathbb{R}}^L$ through the diffusion maps $\Psi_{T}$. This statement is supported by the next result.

Proposition 1.1. The diffusion map $\Psi_T$ embeds the data into the Euclidean space ${\mathbb{R}}^L$ in a way that the Euclidean distance is equal to the diffusion distance up to a relative accuracy $\delta > 0$,

\[\begin{aligned} \|\Psi_T(x) - \Psi_T(y)\| \approx D_T(x, y). \end{aligned}\]

For a proof we refer the reader to Appendix A and Section 2.4 of reference or, alternatively, Section 3 of reference .

The aim of this section is to extend the mathematical framework of diffusion maps to functional data. To this end, it suffices to define a kernel that takes functional data as input instead of multivariate data. This is, the kernel $k$ will now be defined as

\[\begin{aligned} k: L^2([a,b]) \times L^2([a,b]) \to \mathbb{R}. \end{aligned}\] The generalization of the kernel to functional data is done by taking a norm in a function space and defining the kernel based on this norm. For example, in the commonly used RBF kernel, defined as

\[RBF(f, g) = \exp\left(-\frac{\|f - g\|_{L^2}^2}{2 \, l^2}\right),\] the norm $\|\cdot\|_{L^2}$ transforms functional inputs $f$ and $g$ into a scalar, hence processing the functional component of the kernel. This shows that the notion of similarity given by the kernel is essentially inherited from the chosen functional norm. After evaluating the kernel over the functional data, we obtain a weighted graph in the same way we obtained it in the multivariate case. From this point, the rest of the process is identical to the classical diffusion maps method. In the following algorithm we present the steps necessary to obtain the diffusion maps for a given functional dataset.

Diffusion Maps

Construct the weighted graph $(\mathcal{X}, K)$, where \[K = \{k_{ij}\}_{1 \le i, j \le N} \quad \text{with} \quad k_{ij} = k(x_i(t), x_j(t)).\]

Construct the vector of vertex densities, $d^\alpha = (d_1^\alpha, \cdots, d_N^\alpha)$, where \[d_i^\alpha = \Bigl(\sum_{j=1}^{N} k_{ij}\Bigr)^\alpha.\]

Construct the normalized graph $(\mathcal{X}, K^{(\alpha)})$, where \[K^{(\alpha)} = \{ k_{ij}^{(\alpha)} \}_{1 \le i, j \le N} \quad \text{and} \quad k_{ij}^{(\alpha)} = \frac{k_{ij}}{d_i^{\alpha} d_j^{\alpha}}.\]

Construct the vector of vertex densities for the normalized graph, $d^{(\alpha)}$, where \[d_i^{(\alpha)} = \sum_{j = 1}^{N} k_{ij}^{(\alpha)}.\]

Define the transition matrix \[P = (p_{ij})_{1 \le i,j \le N}, \quad \text{with} \quad p_{ij} = \frac{k_{ij}^{(\alpha)}}{d_i^{(\alpha)}}.\]

Obtain the eigenvalues $\{\lambda_l\}_{l \ge 0}$ and the right eigenvectors $\{\psi_l\}_{l \ge 0}$ of $P$, ordered in ascending order of the eigenvalues.

Calculate the diffusion maps for each $i \in \{1, \cdots, N\}$: \[\Psi_{T}(x_i) = \begin{pmatrix} \lambda_1^{T} \psi_1(x_i) \\ \lambda_2^{T} \psi_2(x_i) \\ \vdots \\ \lambda_L^{T} \psi_L(x_i) \end{pmatrix}.\]

Functional Diffusion Maps

Construct the weighted graph $(\mathcal{X}, K)$, where \[K = \{k_{ij}\}_{1 \le i, j \le N}, \quad \text{with } k_{ij} = k(x_i(t), x_j(t)).\]

Construct the vector of vertex densities, \[d^\alpha = (d_1^\alpha, \cdots, d_N^\alpha), \quad \text{where } d_i^\alpha = \Bigl(\sum_{j=1}^{N} k_{ij}\Bigr)^\alpha.\]

Construct the normalized graph $(\mathcal{X}, K^{(\alpha)})$, where \[K^{(\alpha)} = \{ k_{ij}^{(\alpha)} \}_{1 \le i, j \le N}, \quad \text{with } k_{ij}^{(\alpha)} = \frac{k_{ij}}{d_i^\alpha d_j^\alpha}.\]

Construct the vector of vertex densities for the normalized graph, \[d^{(\alpha)} = \bigl(d_1^{(\alpha)}, \cdots, d_N^{(\alpha)}\bigr), \quad \text{where } d_i^{(\alpha)} = \sum_{j=1}^{N} k_{ij}^{(\alpha)}.\]

Define the transition matrix \[P = (p_{ij})_{1 \le i,j \le N}, \quad \text{with } p_{ij} = \frac{k_{ij}^{(\alpha)}}{d_i^{(\alpha)}}.\]

Obtain the eigenvalues $\{\lambda_l\}_{l \ge 0}$ and the right eigenvectors $\{\psi_l\}_{l \ge 0}$ of $P$, ordered in ascending order of the eigenvalues.

The Nyström method: FDM for out-of-sample data

The aim of this section is to present the extension of the FDM method for out-of-sample data via the Nyström method. We now refer as training set and denote $\mathcal{X}_{\text{train}}$ to the set used to compute the diffusion maps in the first place via FDM Algorithm. Our objective is to obtain the diffusion coordinates of a new datum, $\hat{x}_{i}$, which does not belong to this training set without having to recompute the whole method over $\mathcal{X}_{\text{train}} \cup \{\hat{x}_{i}\}$. In other words, we want to predict the diffusion coordinates of data that does not belong to the set we used to train the method in the first place. The mathematical idea behind this prediction is to take a linear combination of the diffusion coordinates from $\mathcal{X}_{\text{train}}$ where the similarity between $\hat{x}_{i}$ and the observations in $\mathcal{X}_{\text{train}}$ serves as a weight in this linear combination. In this sense, instead of defining the diffusion maps as $\Psi_{T} (x_i) = ( \lambda_1^{T} \psi_1 (x_i), \; \lambda_2^{T} \psi_2 (x_i), \; \cdots, \; \lambda_L^{T} \psi_L (x_i) )^t$, we will define them as $\hat{\Psi}_{T} ({\hat{x}}_i (t)) = ( \lambda_1^{T} \hat{\psi}_1 ({\hat{x}}_i), \; \lambda_2^{T} \hat{\psi}_2 ({\hat{x}}_i), \; \cdots, \; \lambda_L^{T} \hat{\psi}_L ({\hat{x}}_i) )^t,$ where $\hat{\psi}_j ({\hat{x}}_i) = \frac{1}{\lambda_j} \sum_{k=1}^{N} \psi_j(x_k) \, p({\hat{x}}_i, x_k),$ with $p({\hat{x}}_i, x_k) = p_{ik}$. Note that the definition of $\hat{\psi}_j ({\hat{x}}_i)$ is precisely a linear combination over $\psi_j(x_k)$ with $p_{ik}$ as the weights. In the following algorithm we present the steps necessary to predict the diffusion coordinates for out-of-sample data via the Nyström method.

Nystrom method for FDM

Nyström method for predicting the diffusion coordinates of out-of-sample data.

Construct the weighted graph ($\mathcal{X}_{\text{out}}$, $K_{\text{out}}$), where $K_{\text{out}} = (k_{ij})_{1 \le i \le M, 1 \le j \le N}$, and $k_{ij} = k({\hat{x}}_i(t), x_j(t))$.

Construct the vector of vertex densities $h^\alpha = (h_1^\alpha, \cdots, h_M^\alpha)$, where $h_i^\alpha = {(\sum_{j=1}^{N} k_{ij})} ^ \alpha$.

Construct the normalized graph $( \mathcal{X}_{\text{out}}, K^{(\alpha)}_\text{out} )$, where $K_{\text{out}}^{(\alpha)} = {(k_{ij}^{(\alpha)})}_{1 \le i \le M, 1 \le j \le N}$ and $k_{ij}^{(\alpha)} = \frac{k_{ij}}{h_i^{\alpha} d_j^{\alpha}}$.

Define the transition matrix $P = \left( p_{ij} \right)_{1 \le i \le M, 1 \le j \le N}$, where $p_{ij} = \frac{k_{ij}^{(\alpha)}}{d_i^{(\alpha)}}$, and $d_i^{(\alpha)} = \sum_{j = 1}^{N} k_{ij}^{(\alpha)}$.

Predict the diffusion coordinates for $i \in \{1, \cdots, M\}\,$: \[\hat{\Psi}_{T} ({\hat{x}}_i) = \begin{pmatrix} \lambda_1^{T} \hat{\psi}_1 ({\hat{x}}_i) \\ \lambda_2^{T} \hat{\psi}_2 ({\hat{x}}_i) \\ \vdots \\ \lambda_L^{T} \hat{\psi}_L ({\hat{x}}_i) \\ \end{pmatrix},\] where \[\hat{\psi}_j ({\hat{x}}_i) = \frac{1}{\lambda_j} \sum_{k=1}^{N} \psi_j(x_k) \, p({\hat{x}}_i, x_k), \quad p({\hat{x}}_i, x_k) = p_{ik}.\]

Examples of use of FDM

Introduction

In this chapter we will present a series of examples to illustrate the use of functional diffusion maps for both synthetic and real-world data. The procedure to create a synthetic dataset is as follows. Let $\mathcal{X} = \{x_i\}_{i=1}^{N}$ be a set of $N$ multivariate observations. Let us further assume that $\mathcal{X}$ lays over a manifold $\mathcal{M} \subset {\mathbb{R}}^D$, but naturally belongs to a manifold in ${\mathbb{R}}^d$ with $d \ll D$. To create a synthetic functional dataset, we will take a basis of functions $\beta = \{{\phi}_i(x)\}_{i=1}^d$, where ${\phi}_i: {\mathbb{R}}\to {\mathbb{R}}$, and convert each point in $\mathcal{X}$ by taking the representation of this point in the basis $\beta$. if $x_i = (x_i^{(1)}, \cdots, x_i^{(n)})$ is a point in $\mathcal{X}$, the corresponding functional observation is ${(x_i^{(1)}, \cdots, x_i^{(n)})}_{\beta} = \sum_{k=1}^d x_i^{(k)} {\phi}_k (x)$. Here, $\mathcal{X}$ will be chosen to be a nonlinear manifold, since one of the advantages this method provides, as opposed to other classical dimensionality reduction methods, is the ability to perform well in nonlinear settings . In the real-world case we work with the phoneme dataset, presented in Chapter 1.

Simple use of FDM: moons dataset

In this section we will present a simple use scenario of functional diffusion maps, corresponding to the use case UC-01 (dimensionality reduction for visualization, thus with embedding dimension of 2 or 3). More specifically, we want to obtain the diffusion coordinates of a functional dataset to evaluate if the method is able to understand the underlying geometry of the data, this is, the relations that the data originally present. To this end we will employ the moons dataset, which represents the trajectory of two moons at different points in time over the space ${\mathbb{R}}^2$. In order to obtain the functional equivalent of the dataset we will be using the basis $\beta = \{sin(4x), \; x^2 + 2x - 2\}$, where $x \in [-\pi, \pi]$. In Figure ? we can see, on the left, the trajectories of two moons, distinguished by colors blue and orange and, on the right, the transformation of each point in the trajectory through the basis $\beta$.

Moons dataset. On the left, the trajectories of two moons in the space ${\mathbb{R}}^2$. On the right, the functional transformation of the moons dataset using the basis $\beta = \{sin(4x), \; x^2 + 2x - 2\}$.

Once the functional data is available, we will now set the parameters of the model. An optimization process has been conducted for the parameter of the RBF kernel, after which we conclude that a value of length scale of 0.25 is a good value for the method to understand the local geometry of the dataset. In later sections we will discuss the process of fine tuning the parameters of the models according to the purpose that we pursue. Moreover, we want to visualize the diffusion coordinates in two dimensions, thus we will set a number of components of 2. Finally, we will set $\alpha = 0$ and only one step in the random walk. In the following code the necessary steps to obtain the diffusion coordinates (once we have already created the object representing the functional dataset) is presented. Note the simplicity provided by the representation of the method as a scikit-learn transformer.

  
    fdm = FDM(n_components=2, kernel=Gaussian(length_scale=0.25), alpha=0, n_steps=1)
    embedding = fdm.fit_transform(fd_moons)

As we can observe in Figure ?, the method has correctly interpreted the topological nature of the data by successfully separating the trajectories associated to both moons. In this sense, the method has mapped points of the same trajectory nearby and has distorted the distance between points of different trajectories.

Diffusion coordinates of the moons dataset for *n_components = 2*, *length_scale = 0.25*, *alpha = 0* and *n_steps = 1*.

Parameter tuning: spirals dataset

Once a basic use of the dimensionality reduction method has been presented, it is time to develop further on how to select the parameters of the model. For this task we will be using a synthetic dataset created from two spirals that resemble the trajectories of two bodies colliding, in a simplified and physically non-rigorous way, as shown in Figure ?. The objective is to interpret which functional observations belong to which body. In this sense, if we carry out the mental experiment of separating the trajectories, ignoring the colors, we might notice that this effort is not as straightforward as in the moons example where the trajectories were easily distinguished.

Spirals dataset. On the left, the trajectories of two bodies toward collision in the space ${\mathbb{R}}^2$. On the right, the functional transformation of the moons dataset using the basis $\beta = \{\frac{x}{3}cos(x), \; \frac{x}{3}sin(x)\}$ with $x \in [-\pi, \pi]$.

We now define two sets of possible values for the parameters alpha and the length scale parameter of the kernel, where $l$ denotes the length scale parameter.

$G_{\alpha} = \{0,\; 0.33,\; 0.66,\; 1\}$ for alpha,
$G_{l} = \{1,\; 3,\; 4.5,\; 7,\; 10,\; 11,\; 20\}$ for length scale.

The aim of this example is to obtain the tuple $(\alpha, l) = G_{\alpha} \times G_{l}$ that we argue is best through visualization for the task of separating both paths. In Figure ? we can observe a grid of images with the diffusion coordinates for the different parameters. The first thing to notice is that the parameter length scale exerts a greater influence in the resulting embedding than the parameter alpha.

Diffusion coordinates of the spirals dataset for parameters *n_steps = 1*, *n_components = 2* and *alpha* and *length scale* in the set $G_{\alpha} \times G_{l}$.

In this sense, the figures of any given row are more similar than those of any given column. Moreover, we can see that the optimal choice of the length scale parameter of the kernel is 4.5, because it visually presents the more clear separation between the spirals. Regarding alpha, given that it is not determinant, we shall choose a value of 0 because, by theory, it is equivalent to skipping the normalization step in the process. In other words, if we obtain a very similar result without performing this step it means that, for this specific dataset and selection of parameters, this effort is not really needed.

In order to further understand the behavior of the method as the length scale parameter changes, let us analyze the form of the RBF kernel function, since this is precisely the starting information available to the method. In particular, the kernel is a measure of similarity between points, hence the larger the value of the kernel between $x$ and $x'$, the more similar these observations are considered to be. In Figure ? we can see the RBF kernel for length scale values of 1, 4.5 and 20. Moreover, it is noticeable how the kernel widens as the length scale increases. For two spiral points which are close their functional equivalents will also be close in the $L^2$ norm, meaning that the value of the RBF kernel will be generally close to one. Nevertheless, if the length scale parameter is too small, even contiguous points may be considered distant and, consequently, not similar. With this information in mind let us now revisit the first row of Figure ?, where all points are displayed in an apparently random manner since no points are considered similar. On the contrary, when the parameter is too large ($l = 20$, last row), the kernel becomes too wide and all points are considered similar to each other. Consequently, the method interprets points from one arm of the spiral as being close to points in the next arm, even though these belong to different trajectories.

RBF kernel function for *length scales* $1, 4.5$ and $20$.

To further illustrate this idea, in Figure ? we can observe the matrices of the weighted graphs, which store the kernel evaluation over the data. In the case $l=1$ the matrix is close to the identity since a given point is not considered to be close to any other. In the case $l=4.5$ the matrix presents non-zero values in the minor diagonals which are close to the main diagonal. The kernel is wide enough so that each point is considered similar to nearby points in the same trajectory, but its not too wide so that points in contiguous arms of the spiral are also considered similar. In fact, this is precisely what happens in the case $l=20$. Here there are four sets of minor diagonals which contain non-zero values. Since the first 100 points in each axis correspond to the first trajectory and the other 100 to the second one, the stripes that appear in the top right and bottom left quadrants (if we divide the matrix in four equal squares) represent contiguous arms of the spiral being considered similar, even though they belong to different trajectories.

l = 1 — Results for different values of the length-scale parameter $l$.

l = 4.5 — Results for different values of the length-scale parameter $l$.

Visualizing diffusion coordinates in three dimensions: Swiss roll dataset

The aim of this section is to develop further on the role of the parameter n_components. In particular, we now want to create a synthetic dataset from a manifold belonging to a three dimensional space, with the aim of visualizing the diffusion coordinates in three dimensions. Moreover, we will not be using a classified dataset and our aim will not be to classify two manifolds but rather to see if the method is able to interpret a more complex structure. In Figure ? we present the dataset that we will be working on this section. The creation of the functional dataset is done through the basis $\beta = \{sin(4x), cos(8x), sin(12x)\}$.

*Swiss roll dataset*. On the left, a manifold in the ${\mathbb{R}}^3$ space whose structure resembles a Swiss roll. On the right, the first fifty observations of the functional transformation of the Swiss roll dataset using the basis $\beta = \{sin(4x), cos(8x), sin(12x)\}$.

We now perform a parameter tuning over length scale and we fix alpha to 1, even though we could have chosen other value for alpha. In particular, we choose length scale over the set $G'_{l} = \{1,\; 1.25,\; 1.5,\; 3,\; 3.75,\; 4\}$. These values have been chosen to represent the most diversity in the diffusion coordinates with the aim of enriching the analysis here presented.

Diffusion coordinates for the Swiss roll dataset for the parameters in the set $G'_{l}$.

In Figure ? we present the diffusion coordinates for the length scale values in $G'_{l}$. We first notice that for a value of $l=3$ the manifold has been unrolled. Note that in the roll there are pairs of points whose euclidean distance is small but whose shortest path contained in the manifold is significantly larger, since it must complete an entire loop. In this sense, the process happens to have taken into account the shortest path distance rather than the euclidean one. Thus, one may argue that the topological nature of the data has been respected. For these reasons the value $l=3$ constitutes a good choice. Nevertheless, one may argue that $l=1.5$ and $l=3.9$ are also good choices for this same reason. In fact, any value in the interval $[1.5, \; 3.9]$ allows for this unrolling of the manifold. In this regard, the election of the optimal parameters is relative to the problem at hand and in this case there is a range of reasonable values, rather than a single one. For values less than $1.5$ the kernel is too narrow to capture the similarity between points in the same leaf of the Swiss roll, similarly to how it occurred in the spirals dataset. The most noticeable thing is the sudden scattering of the points when changing from a value of length scale of $3.9$ to $4$. This sudden scattering occurs because around $l=4$ there is a turning point where the kernel widens enough such that points in the contiguous leaves of the Swiss roll start to be considered close. This process can be seen in Figure ?, where we take a more refined look over the effect of the length scale parameter in the interval $[3.987, 4]$. As the length scale parameter goes to $4$, the diffusion coordinates scatter, until we recover diffusion coordinates which are similarly displayed to the original dataset.

Diffusion coordinates for the Swiss roll dataset for the parameters in the set $\{3.987, \;3.988, \;3.989, \;4\}$.

Real-world application: phoneme dataset

In Chapter 1, the work was initially motivated by presenting a dataset containing information about five phonemes. In this section, two experiments are performed to test the functional diffusion maps method over this dataset. In a first experiment, the aim is to classify these functional observations into their corresponding phoneme labels as a part, for example, of a speech recognition project. In a second experiment, the aim is to visually compare the predicted diffusion coordinates of an out-of-sample test set, obtained via the Nyström method, and the real values. We now define $\mathcal{D} = (X, y)$, where $\mathcal{X}$ is the functional dataset and $y$ contains the classification of the functional observations into phonemes.

Experiment 1: Testing the accuracy of FDM for classification in relation to the size of the training set.

In this first experiment, we will use the nearest neighbors classification method together with functional diffusion maps. We refer to this combination as pipeline from now on. Our objective is to answer the following question: How much training information does this pipeline of methods need to properly classify a test set? The values of n_steps and n_components are fixed to 1 and 2, respectively. The scheme is as follows:

Define a test set $\mathcal{D}_{\text{test}}$ and $\mathcal{D}_{\text{train}}$ with $|\mathcal{D}_{\text{test}}| = 100$ and $|\mathcal{D}_{\text{train}}| = 400$.
Define $N$ training subsets $\mathcal{D}_{\text{train}}^i$ with $i \in \{1, \cdots, N\}$ such that $\mathcal{D}_{\text{train}}^1 \subset \cdots \subset \mathcal{D}_{\text{train}}^N$. Here, $\mathcal{D}_{\text{train}}^N = \mathcal{D}_{\text{train}}$ and $|\mathcal{D}_{\text{train}}^i| = 20i$.
Define a parameter grid to perform the optimization of the pipeline. The optimization will be performed over parameters alpha, the kernel type, the length scale parameter of the kernel and the number of neighbors of the nearest neighbors classifier. In particular,
- $G_{\alpha} = \{0,\; 0.25,\; 0.5,\; 0.75,\; 1.0\}$.
- $G_{kernel} = \{RBF\;kernel, Laplacian\;kernel\}$ and each kernel using length scale values $G_{l} = \{0.5, 1, 2, 3, 4, 6, 10\}$.
- $G_{neighbors} = \{3, 5, 7, 11, 19\}$.
for $i \in \{1, \cdots, N\}$, carry out the use case UC-03 (parameter tuning for classiﬁcation with cross-validation via a scikit-learn pipeline consisting of FDM and a classiﬁcation method) with $\mathcal{D}_{\text{train}}^i$ as training and $\mathcal{D}_{\text{test}}$ as test sets.
1. Perform parameter tuning in the pipeline (FDM + KNN) using $\mathcal{D}_{\text{train}}^i$.
2. Train the pipeline over the training subset $\mathcal{D}_{\text{train}}^i$ using the parameters that have been deemed optimal.
3. Predict the accuracy (percentage of correctly classified data) when predicting the diffusion coordinates of $\mathcal{D}_{\text{train}}^i$ and $\mathcal{D}_{\text{test}}$.
Generate a plot exhibiting the accuracies for the different sizes of the training subsets.

Performing the aforementioned steps leads to obtaining the graph in Figure ?. Here, the accuracies for predicting both the test and training subsets are represented. Two horizontal lines, corresponding to the average of both functions for $N > 200$ are also plotted. Firstly, we observe that the accuracy grows rapidly and stagnates at a value higher than $80\%$ accuracy for the test set. The fact that it rapidly reaches the $80\%$ mark means that in some circumstances it may be reasonable to use a smaller training subset with the aim of reducing computational costs. Secondly, we also observe that the prediction of $\mathcal{D}_{\text{train}}^i$ is not $100\%$ accurate, even though we have used the same set for training. This makes sense, since the Nyström method uses an average of the diffusion coordinates of the training set to obtain the prediction for a given value. The weight that each diffusion coordinate acquires during the prediction is dependent on the value of the kernel for the predicted functional observation and the observations of the training dataset. Hence, when we try to predict a point belonging to the training subset, even though the diffusion coordinate with the most importance is the one corresponding to this same observation (since the kernel is maximum), there is noise coming from the other observations, since the value of the kernel is not zero for these. This idea also helps avoid overfitting and reinforces the generalization capacity of the method.

Accuracies in the prediction of $\mathcal{D}_{\text{test}}$ and $\mathcal{D}_{\text{train}}^i$ for different sizes of the training subset.

Experiment 2: Visualizing the prediction of the diffusion coordinates

To grasp the functioning of the functional diffusion maps in the pipeline as well as the capabilities that the Nyström method provides when predicting diffusion coordinates, another experiment has been assembled. This time, we start from the same training set as before, $\mathcal{D}_{\text{train}}$, and define six training subsets of lengths $20, 40, 120, 160, 280$ and $400$, similarly to how we did in the previous experiment. In the images the lengths of the training subsets are referred to as $N$. The main difference with Experiment 1 is that the same values of the parameters will be used for all sizes of the training subset. Specifically, we fix n_components = 2, n_steps = 1, alpha = 0 and a value of length scale of 4 for visual interpretability.

In Figure ? a set of images is provided showing a comparison between the true (crosses) and the predicted (dots) diffusion coordinates for the test set, for different values of $N$. This is precisely an example of the use case UC-02 (dimensionality reduction for visualization of out-of-sample functional data via the Nyström method). The so called true diffusion coordinates constitute the ground truth of this experiment and are obtained by training the method over $\mathcal{D}_{\text{train}} \cup \mathcal{D}_{\text{test}}$ and then keeping only the diffusion coordinates corresponding to $\mathcal{D}_{\text{test}}$. For values $N=20$ and $40$, the predicted diffusion coordinates do not match the real embeddings. Even so, they are really close to the true embeddings, specially if we take into account the really small size of the training subset. For values $N = 120$ and $160$, the diffusion coordinates are practically the same, thus adding $40$ points had little effect. For $N = 280$, the diffusion coordinates for phonemes /aa/ and /ao/ already match the true values. Finally, for $N=400$, even if the embeddings do not need to be equal because the true embedding was obtained by training over $\mathcal{D}_{\text{train}} \cup \mathcal{D}_{\text{test}}$ and the prediction by training over $\mathcal{D}_{\text{train}}$, in practice they are almost indistinguishable.

Comparison between the true (crosses) and predicted (dots) diffusion coordinates of the test set for $N \in \{20, 40, 120, 160, 280, 400\}$. A rescaling factor, $\sqrt{\frac{N}{N_{train} + N_{test}}}$, is applied to the predicted coordinates, since the scale depends on the training set size. The method has been trained for *n_components = 2*, *n_steps = 1*, *alpha = 0* and *length_scale* = 4.

About this work

This work is in essence a reproducibility study of the work [1] and belongs to a greater effort focused on implementing FDM in the Python library for functional data Scikit-FDA. Consider citing [1]. Contributions are welcome, feel free to open an issue with your improvements, suggestions, or errors found here.

Bibliography

Barroso, María, Alaíz, Carlos, Fernández Pascual, Ángela, & Torrecilla, José. (2023). Functional Diffusion Maps. Statistics and Computing, 34(22).
Kokoszka, P., & Reimherr, M. (2017). Introduction to Functional Data Analysis. Chapman and Hall/CRC. DOI
Wang, Jane-Ling, Chiou, Jeng-Min, & Müller, Hans-Georg. (2016). Functional Data Analysis. Annual Review of Statistics and Its Application, 3, 257-295.
Fernández, Ángela, González, Ana M., Díaz, Julia, & Dorronsoro, José R. (2015). Diffusion Maps for Dimensionality Reduction and Visualization of Meteorological Data. Neurocomputing, 163, 25-37. DOI
Coifman, Ronald R., & Lafon, Stéphane. (2006). Diffusion Maps. Applied and Computational Harmonic Analysis, 21(1), 5-30. URL
Cayton, Lawrence. (2005). Algorithms for Manifold Learning.
Grupo Aprendizaje Automático, UAM. (2024). scikit-fda: Functional Data Analysis Python Package. GitHub. URL
Ramos-Carreño, Carlos, Torrecilla, José Luis, Carbajo-Berrocal, Miguel, Marcos, Pablo, & Suárez, Alberto. (2023). scikit-fda: A Python Package for Functional Data Analysis. Journal of Open Source Software.
Rabin, N., & Coifman, R.R. (2012). Heterogeneous Datasets Representation and Learning using Diffusion Maps and Laplacian Pyramids. Proceedings of the SIAM International Conference on Data Mining (SDM), 189-199.