Understanding the Basis of Graph Convolutional Neural Networks

Understanding the Basis of Graph Convolutional Neural Networks via an Intuitive Matched Filtering ApproachLjubiÅ¡a Stankovic, Danilo P. MandicGraph convolutional neural networks (GCNNs) are becoming a model of choice for learning on irregular domains. However, due to the black-box nature of NNs, their underlying principles are rarely examined in depth.ScopeWe revisit the operation of GCNNs from first principles and show that their key componentâ€”the convolutional layerâ€”effectively performs matched filtering of its inputs with a set of templates (filters and kernels) of interest. This serves as a vehicle to establish a compact matched filtering perspective of the whole convolutionâ€“activationâ€“pooling chain, which allows for a theoretically well-founded and physically meaningful insight into the overall operation of GCNNs. This is shown to help mitigate their interpretability and explainability issues, together with providing intuition for further developments, their applications, and educational purposes. This â€œLecture Notesâ€ column is supported by online supplementary material (available at https://doi.org/10.1109/MSP.2022.3207304), which provides more detail on several aspects of GCNN operation and training.RelevanceThe success of deep learning and CNNs has also highlighted that NN-based analysis of signals and images of large sizes poses a considerable challenge, as every input sample/pixel needs to be associated with one neuron at the input layer. A meaningful NN-based analysis requires at least one hidden layer, yet, even for a simplest fully connected (FC) hidden layer, the number of weights increases exponentially with data volumeâ€”the so-called curse of dimensionality. This dimensionality bottleneck may be partially mitigated by leveraging the smooth nature of most physical data, whereby the neighboring signal samples (or image pixels) typically exhibit a degree of similarity. This allows us to describe signals/images in terms of their local characteristic features (patterns) and exploit such local information throughout the processing chain. In this way, a learning task boils down to a much more computationally efficient search for features (patterns) in data, in contrast to standard computationally hungry brute-force approaches. In the following, unless otherwise stated, we use the term signal for both signals and images.An important advantage of operating in the feature space is that it resolves the problem of the position change of patterns in signals. Namely, if a certain signal feature is moved from its original temporal or spatial position, e.g., due to translation, a standard sample/pixel-based approach would assume that it is presented with a completely different set of samples/pixels; in contrast, a feature-based approach will look for this specific pattern of interest anywhere in the signal. This underpins the operation of CNNs, which effectively perform a search for features in the analyzed signal; this is achieved in such a way that these features are invariant to changes in their positions [1], [2]. In this way, each â€œfeature windowâ€ used in the convolution operation within CNNs (the convolution filter or convolution kernel) becomes, through training, the best match to a specific feature within the analyzed signal. Such feature matching is performed over the whole signal, akin to a mathematical lens in search of some specific forms [3].Irregular domains are conveniently modeled as graphs [4], [5], and, for learning on such domains, it is natural to consider graphs in conjunction with NNsâ€”the graph NN (GNN) paradigm. Such an approach benefits from the universal approximation property of NNs, pattern matching inherent to CNNs, and the ability of graphs to capture local relations and implicit complex couplings in data. Research on GNNs emerged almost two decades ago [6], [7], with more recent developments centered around GCNNs [8], [9]. These have largely focused on learning aspects of CNNs [10], [11] and assume stationarity (via shift invariance of convolution) and compositionality (via downsampling or pooling) [5] but without considering the wider context underpinning the operation of GNNsâ€”a subject of this â€œLecture Notesâ€ column.The aim of this â€œLecture Notesâ€ column is to revisit the operation of GCNNs to reveal the basic principles underpinning their functionality and, in this way, help demystify their operation for a generally knowledgeable reader and for educational purposes. Upon establishing that the operation of convolutional layers in standard CNNs rests upon the classical signal processing concept of matched filtering, we introduce an analogous graph matched filtering framework for understanding the convolutionâ€“activationâ€“pooling chain within GCNNs, which maintains intuition and physical interpretability through the whole GCNN operation. It is our hope that such a perspective will further help the seamless migration of concepts and ideas from signal processing into the more empirical area of NNs as well as serve as a platform for extensions and application opportunities in our data-hungry world. For enhanced intuition, physical insight into the operation of GNNs is provided through a step-by-step numerical example, which also visualizes information propagation through GCNNs, thus illuminating all stages of GCNN operation and learning.PrerequisitesThis â€œLecture Notesâ€ column assumes a basic knowledge of linear algebra and digital signal processing, and it is supported by online supplementary material (available at https://doi.org/10.1109/MSP.2022.3207304), which contains more details of the key underlying concepts, together with a step-by-step elaboration of back-propagation training of GCNNs.Problem statement and solutionThe matched filter and convolutionThe use of a convolution filter/kernel underpins the operation of CNNs, yet the key question of how to justify the use of convolution to detect features in a signal remains largely unansweredâ€”a subject of this â€œLecture Notesâ€ column. To address this issue, recall that matched filtering is a widely used technique for the detection of the presence of a known template (feature) w in an observed unknown noisy signal x, and it is achieved by cross-correlating signal x with template w, that is,

\[{y}{(}{n}{)} = \mathop{\sum}\limits_{m}{x}{(}{n} + {m}{)}{w}{(}{m}{).} \tag{1} \]

The maximum of this cross correlation y(n) will occur at the time instance when the template w(n) is present in the observed signal x(n), that is, when the pattern in x(n) matches the feature (pattern) w(n). This technique also operates in the presence of noise, with the matched filter aiming to maximize the signal-to-noise ratio. However, the implementation of the cross correlation in (1) is awkward for streaming data, where it is customary to use a standard digital filter, given by

\[{y}{(}{n}{)} = \mathop{\sum}\limits_{m}{x}{(}{n}{-}{m}{)}{w}{(}{m}{)} = {x}{(}{n}{)}\,{*}\,{w}{(}{n}{)}{.} \tag{2} \]

This filter performs the convolution between x(n) and w(n), denoted by

${y}{(}{n}{)} = {x}{(}{n}{)}\,{*}\,{w}{(}{n}{)}$

. A comparison between the cross correlation in (1) and the convolution in (2) immediately suggests a way to implement the matched filter through convolution, namely, by time reversing the template,

${w}{(}{m}{)}\rightarrow{w}{(}{-}{m}{)}$

, in the convolution sum in (2), to arrive at

\begin{align*}{y}{(}{n}{)} & = {x}{(}{n}{)}\,{*}\,{w}{(}{-}{n}{)} \\ & = \mathop{\sum}\limits_{m}{x}{(}{n}{-}{m}{)}{w}{(}{-}{m}{)} \\ & = \mathop{\sum}\limits_{m}{x}{(}{n} + {m}{)}{w}{(}{m}{)} = {x}{(}{n}{)}\,{*}_{c}{w}{(}{n}{)}{.} \tag{3} \end{align*}

The symbol

$\ast{}_{c}$

denotes the convolution with the time-reversed feature/template vector

${w}{(}{-}{n}{)}$

, which also serves as the impulse response of this filter. Figure 1 illustrates the operation of a matched filter, which searches for a bipolar square-wave template in the observed signal and is implemented through convolution and thresholding.lecturenotes01-3207304Figure 1. The operation of a matched filter, which confirms the presence and position of a known square-wave template (thin blue line)

$w$

in the observed signal

$x$

(thick black line): (a) the matched filter (cross correlation)

${x}{(}{n}{)}\star{w}{(}{-}{n}{)}$

and (b) convolution

${x}{(}{n}{)}\star{w}{(}{n}{).}$

The matched filter essentially performs the cross correlation between a known template of interest

$w(n)$

and the unknown input

${x}{(}{n}{)}{.}$

In practical applications, this cross correlation is implemented through a convolution between

$x(n)$

and a time-reversed template

${w}{(}{-}{n}{),}$

as in (a).Remark 1The convolution operation in CNNs is applied after one of the signals is time reversed; that is,

${x}{(}{n}{)}\,{*}\,{w}{(}{-}{n}{)} = {x}(n)\,{*}_{c}{w}{(}{n}{)}$

, which is implicitly indicated in CNN implementations in the literature, for example,

${x}\,{*}\,{rot}{180}{(}{w}{)}$

or

${conv}{(}{x},{reverse}{(}{w}{))}{.}$

Therefore, the convolutional layer in CNNs performs precisely the matched filtering operation, described in (3) and Figure 1, yet NNs equipped with this pattern-matching operation are referred to as convolutional, not correlational, NNs.Remark 2Convolution-based feature detection is independent of the feature position within the considered signal x(n) since

${y}{(}{n}{)} = {x}{(}{n}{)}\,{*}\,{w}{(}{-}{n}{)}$

is calculated by sliding the window (filter/kernel)

${w}{(}{-}{n}{)}$

of length M which multiplies the corresponding segments of the signal x(n) for all n.Interpretation of the building blocks within CNNsConsider the problem of determining the best match between a received signal x(n) and one of the template waveforms

${w}_{k}(n)$

from a predefined set of K such template waveforms (the dictionary). For template waveforms with equal normalized energies, the maximum of the cross correlation between x(n) and the elements of

$\left\{{{w}_{k}{(}{n}{),}{k} = {1},{2},\ldots,{K}}\right\}$

will indicate the best match between the signal x(n) and the features

${w}_{k}(n).$

In light of Remark 1, this can be performed by passing the received waveform through a bank of digital filters, each having as the impulse response one of the time-reversed template waveforms from the dictionary,

${w}_{k}{(}{-}{n}{)}$

, that is, through the convolution

\[{y}_{k}{(}{n}{)} = {x}{(}{n}{)}\,{*}\,{w}_{k}{(}{-}{n}{)} = {x}{(}{n}{)}\,{*}_{c}{w}_{k}{(}{n}{),}\,{k} = {1},\ldots,{K}{.} \tag{4} \]

Finally, the decision on which feature

${w}_{k}(n)$

is contained in the input signal is based on a simple thresholding operation:

\[{k} = \arg{\{}\mathop{\max}\limits_{k}{\{}{A}_{k},{k} = {1},{2},\ldots,{K}{\}\}} \tag{5} \]

where

${A}_{k} = {\max}_{n}{\{}{x}{(}{n}{)}\,{*}_{c}{w}_{k}{(}{n}{)\}}$

.Remark 3: Rectified linear unitThe decision in (5) on whether the input x(n) contains a feature

${w}_{k}(n)$

is based on the maximum value of the output of the bank of matched filters and will not be affected if the negative values in the output in (4) are not considered, that is, upon the application of the mapping

\begin{align*}{o}_{k}{(}{n}{)} & = {\text{ReLU}}{\{}{y}_{k}{(}{n}{)\}} \\ & = {\text{ReLU}}{\{}{x}{(}{n}{)}\,{*}_{c}{w}_{k}{(}{n}{)\},} \\ &\qquad\qquad{k} = {1},\ldots,{K} \tag{6} \end{align*}

where ReLU denotes the rectified linear unit, a common nonlinear activation function in CNNs, defined by

\[{\text{ReLU}}{(}{x}{)} = \max{\{}{0},{x}{\}}{.} \tag{7} \]

The decision in (5) is also not affected by scaling the amplitudes of all kernels

${w}_{k}(n)$

by the same positive factor, even at each iteration. The same reasoning presented for the ReLU applies to the unipolar sigmoid (tanh and logistic) and unipolar binary nonlinearities.Remark 4: Max-poolingThe maximum in (6) is calculated over all available samples of the signal x(n), which is computationally inefficient. This operation will not be compromised if the output domain is split into subdomains of P time instants since, after a local maximum is detected, we are not interested in the neighboring values of the matched filter output that correspond to the same feature. The presence of local features is then examined within these subdomains, which underpins the so-called max-pooling operation in CNNs and their efficiency over standard NNs. For more detail, see Supplement B (available at https://doi.org/10.1109/MSP.2022.3207304).We have shown that the convolutional layer within a CNN, which consists of a set of convolutional filters, effectively operates as a bank of matched filters. Then, hierarchical features of varying degrees of complexity can be catered for by employing several convolutional layers in a CNN. In addition, to learn various aspects of the feature space, different convolution filters can be applied at different convolutional layers, each aiming to â€œzoom intoâ€ a different feature within the analyzed signal. Such flexibility makes convolutional networks suitable for robust and efficient analysis and the classification of signals and images.Remark 5In a special case when all convolution filter waveforms are symmetric, that is,

${w}_{k}{(}{n}{)} = {w}_{k}{(}{-}{n}{),}$

the convolution filter and the matched filter produce the same result:

${x}{(}{n}{)}\,{*}\,{w}_{k}{(}{n}{)} = {x}{(}{n}{)}\,{*}\,{w}_{k}{(}{-}{n}{)} = {x}{(}{n}{)}\,{*}_{c}{w}_{k}{(}{n}{).}$

Graph filteringConsider a graph signal that is observed at N vertices of a graph and is given by

\[{x} = {[}{x}{(}{1}{),}{x}{(}{2}{),}\ldots,{x}{(}{N}{)]}^{T}{.} \tag{8} \]

To extend the matched filtering perspective of CNNs to GCNNs, we first revisit the notion of the system on a graph. More detail can be found in Supplement A (available at https://doi.org/10.1109/MSP.2022.3207304).Graph filteringThe standard finite impulse response system in the discrete time domain, with the coefficients h ₀, h ₁, â€¦,

${h}_{{M}{-}{1}},$

is given by

\begin{align*}{y}{(}{n}{)} & = {h}_{0}{x}{(}{n}{)} + {h}_{1}{x}{(}{n}{-}{1}{)} + \cdots \\ & + {h}_{{M}{-}{1}}{x}{(}{n}{-}{M} + {1}{)}{.}\end{align*}

In a direct analogy, a system on a graph is defined using a graph shift operator S to give [12], [13]

\[{y} = {h}_{0}{x} + {h}_{1}{Sx} + \cdots + {h}_{{M}{-}{1}}{S}^{{M}{-}{1}}{x}{.} \tag{9} \]

First-order system on a graphThis concept is central to our perspective on GCNNs and may assume different forms depending on the choice of the graph shift operator

$\text{S}$

in (9). Commonly used first-order systems on a graph include the following:

For those with the graph Laplacian L as the graph shift operator, that is,

${S} = {L}$

, the first-order system assumes the form

\[{y} = {h}_{0}{x} + {h}_{1}{Lx}{.} \tag{10} \]

Notice that the calculation of the output signal requires only samples from the one-neighborhood of every vertex.

Those employing the normalized graph Laplacian

${L}_{N},$

shift

\begin{align*}{\bf{L}}_{N} & = {\bf{D}}^{{-}{1}{/}{2}}{\bf{LD}}^{{-}{1}{/}{2}} \\ & = {\bf{D}}^{{-}{1}{/}{2}}{(}{\bf{D}}{-}{W}{)}{\bf{D}}^{{-}{1}{/}{2}} \\ & = {\bf{I}}{-}{\bf{D}}^{{-}{1}{/}{2}}{\bf{WD}}^{{-}{1}{/}{2}} = {\bf{I}}{-}{W}_{N}\end{align*}

where

${W}_{N} = {\bf{D}}^{{-}{1}{/}{2}}{\bf{WD}}^{{-}{1}{/}{2}}$

is also commonly used as a shift operator in convolutional layers of GCNNs [6], [14]. When

${S} = {L}_{N}$

, the first-order graph system has the form

\begin{align*}{y} & = {(}{h}_{0}{L}_{N}^{0} + {h}_{1}{L}_{N}^{1}{)}{x} \\ & = {(}{h}_{0} + {h}_{1}{)}{x}{-}{h}_{1}{W}_{N}{x}{.} \tag{11} \end{align*}

For multichannel systems, (11) should involve the channel index k. Then, for the kth channel with the input signal x and the output

${y}_{k},$

we have

\begin{align*}{y}_{k} & = {w}_{k}{(}{0}{)}{x} + {w}_{k}{(}{1}{)}{\bf{D}}^{{-}{1}{/}{2}}{\bf{WD}}^{{-}{1}{/}{2}}{x} \\ & = {w}_{k}{(}{0}{)}{x} + {w}_{k}{(}{1}{)}{W}_{N}{x}, \\ &\qquad\qquad{k} = {1},{2},\ldots,{K} \tag{12} \end{align*}

where the weights

${w}_{k}(0)$

and

${w}_{k}(1)$

correspond respectively to the weights

${(}{h}_{0} + {h}_{1}{)}$

and

${(}{-}{h}_{1}{)}$

in (11). A simplification that uses only one parameter,

${w}_{k}{(}{0}{)} = {w}_{k}{(}{1}{)} = {\theta}_{k},$

was originally proposed for the convolutional layer in GCNNs in [6] and [15] in the form

${y}_{k} = {\theta}_{k}{(}{x} + {W}_{N}{x}{)}{.}$

However, this reduces the parameter space (oversimplifies the feature form) of GCNNs, and we resort to (12)â€”see Supplement A (available at https://doi.org/10.1109/MSP.2022.3207304) for more detail.

Remark 6The common choice of

${w}_{k}{(}{0}{)} = $

${w}_{k}(1)$

forces all learned kernels to perform low-pass filtering [see the supplementary material (available at https://doi.org/10.1109/MSP.2022.3207304)], leading to the problem of oversmoothing in the GCNN. Observe that this problem is readily rectified by simply not imposing the (unnecessary) constraint

${w}_{k}{(}{0}{)} = {w}_{k}{(}{1}{)}$

, as in the matched-filtering interpretation in (11) and (12). Such an intuitive and elegant solution has been overlooked by many attempts trying to resolve the oversmoothing problem in GCNNs.

Those based on the random walk graph shift operator

${S} = {\bf{D}}^{{-}{1}}{W}$

produce the first-order multichannel system

\begin{align*}{y}_{k} & = {w}_{k}{(}{0}{)}{x} + {w}_{k}{(}{1}{)}{\bf{D}}^{{-}{1}}{Wx}, \\ &\qquad\qquad\qquad{k} = {1},{2},\ldots,{K}{.} \tag{13} \end{align*}

Notice that this graph shift operator does not, generally, preserve the symmetry property of the shift matrix, so that the eigenvectors may assume complex values.

For directed unweighted graphs, we may use the adjacency matrix A as a (nonsymmetric) graph shift to yield

\begin{align*}{y}_{k} & = {w}_{k}{(}{0}{)}{x} + {w}_{k}{(}{1}{)}{Ax} \\ & + {w}_{k}{(}{2}{)}{A}^{T}{x}, \tag{14} \end{align*}

where

$\text{Ax}$

denotes the backward shift, and

${A}^{T}\text{x}$

is the forward shift on a graph. Note that we can equally use the normalized adjacency matrix

$\text{A}/{\lambda}_{\max}$

as a graph shift.

The spectral description of a system on a graph is obtained from the eigendecomposition of the graph shift operator

${S} = {U}\Lambda{U}^{{-}{1}},$

where U is the transformation matrix with the eigenvectors of S as its columns, and

$\Lambda$

is a diagonal matrix of the eigenvalues of S. Upon premultiplying the vertex domain relation in (9) by the inverse transformation

${U}^{{-}{1}}$

, we obtain

\[{Y} = {h}_{0}{X} + {h}_{1}\Lambda{X} + \cdots + {h}_{{M}{-}{1}}{\Lambda}^{{M}{-}{1}}{X} \tag{15} \]

where

${X} = {U}^{{-}{1}}{x}$

and

${Y} = {U}^{{-}{1}}{y}$

are the graph Fourier transforms (GFTs) of the graph signals x and y, with the corresponding inverse GFTs (IGFT)

${x} = {UX}$

and

${y} = {UY}$

. Notice that the transfer function of this system is a diagonal matrix

${H}{(}\Lambda{),}$

defined by

\begin{align*}{Y} & = {H}{(}\Lambda{)}{X} \\ & = {(}{h}_{0} + {h}_{1}\Lambda + \cdots + {h}_{{M}{-}{1}}{\Lambda}^{{M}{-}{1}}{)}{X}{.} \tag{16} \end{align*}

Remark 7: Graph convolutionThe graph convolution operator follows directly from the elementwise form of the transfer function of the system on a graph in (16), given by

${Y}{(}{k}{)} = {H}{(}{\lambda}_{k}{)}{X}{(}{k}{).}$

Physically, graph convolution is the IGFT of Y(k)â€”that is,

${y}{(}{n}{)} = {x}{(}{n}{)}\star{h}{(}{n}{)} = $

${IGFT}\left\{{X(k)H({\lambda}_{k})}\right\}$

â€”which follows from classical analysis on regular domains [16].To distinguish between classical convolution and graph convolution, we use slightly different symbols: * and

$\star$

, respectively. To distinguish between the GFT of a signal X(k) and the transfer function calculated based on the eigenvalues

$H({\lambda}_{k})$

, we use different notations (positions of index k) for these two functions. This topic is studied in detail in [13] and [17] as well as in Supplement A (available at https://doi.org/10.1109/MSP.2022.3207304).Matched filter on a graphConsider a graph signal x(n) and assume that it is processed by a system on a graph whose transfer function is

$G({\lambda}_{k})$

, with

${g}{(}{n}{)} = {IGFT}{\{}{G}{(}{\lambda}_{k}{)\}}$

. The output of this system is, then,

\begin{align*}{y}{(}{n}{)} & = {x}{(}{n}{)}\star{g}{(}{n}{)} = {\bf{I}}{G}{F}{T}{\{}{X}{(}{k}{)}{G}{(}{\lambda}_{k}{)\}} \\ & = \mathop{\sum}\limits_{{k} = {1}}^{N}{X}{(}{k}{)}{G}{(}{\lambda}_{k}{)}{u}_{k}{(}{n}{)}\end{align*}

with

${u}_{k}(n)$

as elements of the kth column (the kth eigenvector) of the eigenvector matrix U.A (graph-)matched filter design problem consists of finding the transfer function

$G({\lambda}_{k})$

of the system that maximizes the power of the output signal y(n) for an input signal x(n) at a vertex n = n ₀. Then, the system output at n ₀,

$y({n}_{0})$

, is

\begin{align*}\left|{y({n}_{0})}\right|^{2} & = \left|\mathop{\sum}\limits_{{k} = {1}}^{N}{X}(k){G}({\lambda}_{k}){u}_{k}({n}_{0})\right|^{2} \\ & \leq\mathop{\sum}\limits_{{k} = {1}}^{N}\left|{X(k)}\right|^{2}\mathop{\sum}\limits_{{k} = {1}}^{N}\left|{G({\lambda}_{k}){u}_{k}({n}_{0})}\right|^{2}{.}\end{align*}

According to this Schwartz inequality, the maximum is achieved when the equality holds, which, in general, yields the graph matched filter (GMF) in the form [with

${(}\cdot{)}^{\ast}$

as the complex conjugate] given by

\[{G}{(}{\lambda}_{k}{)}{u}_{k}{(}{n}_{0}{)} = {X}^{*}{(}{k}{)} \tag{17} \]

up to a possible scaling constant. In our context, we are not interested in the minimum value, which occurs for a negative scaling constant; that is,

${G}{(}{\lambda}_{k}{)}{u}_{k}{(}{n}_{0}{)} = {-}{X}^{*}{(}{k}{)}$

. From (17), the maximum value of the output is given by

\[{y}{(}{n}_{0}{)} = \mathop{\sum}\limits_{{k} = {1}}^{N}{{\left|{X(k)}\right|}^{2}} = {E}_{x}\]

where E_x is the input signal energy.Remark 8In classical analysis (where the GFT is complex -valued and corresponds to the standard discrete Fourier transform (DFT), it is well known that

\begin{align*}{G}{(}{\lambda}_{k}{)} & = {X}^{*}{(}{k}{)/}{u}_{k}{(}{n}_{0}{)} \\ & = {X}^{*}{(}{k}{)}{e}^{{-}{j}{2}{\pi}{n}_{0}{k}{/}{N}}{/}\sqrt{N}{.}\end{align*}

For

${n}_{0} = {0}$

, the matched filter impulse response becomes

\[{g}{(}{n}{)} = {IDFT}{\{}{G}{(}{\lambda}_{k}{)\}} = {x}^{*}{(}{-}{n}{).}\]

However, in the case of general graphs and graph signals, the vertex domain form of (17) is much more complicated due to the fact that

$G({\lambda}_{k})$

is calculated based on the eigenvalues, while X(k) is the GFT of the signal [see Supplement A (available at https://doi.org/10.1109/MSP.2022.3207304)].We now proceed to analyze more general forms of GMFs, starting with a special case when the input signal can be considered as a result of a diffusion process on a graph.GMFThe intuition and implementation of the GMF can be significantly simplified if the considered graph signal x(n) is a result of an

${(}{M}{-}{1}{)}$

-step diffusion process, with the unit delta pulse at an arbitrary vertex n ₀ as the initial signal

${x}_{0}$

in the diffusion.Consider a graph signal arising from an

${(}{M}{-}{1}{)}$

-step diffusion starting from a delta pulse

${x}_{0}$

at a vertex n ₀; that is,

${x}_{0}{(}{n}{)} = {\delta}{(}{n}{-}{n}_{0}{)}$

. By repeatedly applying the graph shift operator

${(}{M}{-}{1}{)}$

times, the resulting graph signal x is given by

\begin{align*}{x} & = {a}_{0}{x}_{0} + {a}_{1}{L}_{N}{x}_{0} + \cdots \\ &\quad + {a}_{{M}{-}{1}}{L}_{N}^{{M}{-}{1}}{x}_{0} \tag{18} \end{align*}

where

${a}_{\nu},{\nu} = {0},{1},\ldots,{M}{-}{1}$

, are the diffusion constants, while the GFT of

${x}_{0}$

is defined as

\[{X}_{0}{(}{k}{)} = \mathop{\sum}\limits_{{n} = {1}}^{N}{{x}_{0}}{(}{n}{)}{u}_{k}{(}{n}{)} = {u}_{k}{(}{n}_{0}{).}\]

The GFT of the graph signal x follows from (18) and (15) as

\[{X} = \left({{a}_{0} + {a}_{1}{\Lambda}_{N} + \cdots + {a}_{{M}{-}{1}}{\Lambda}_{N}^{{M}{-}{1}}}\right){X}_{0} \tag{19} \]

or, in an elementwise form,

\begin{align*}{X}{(}{k}{)} & = \left({{a}_{0} + {a}_{1}{\lambda}_{k} + \cdots + {a}_{{M}{-}{1}}{\lambda}_{k}^{{M}{-}{1}}}\right) \\ & \times{u}_{k}{(}{n}_{0}{).} \tag{20} \end{align*}

The spectral domain solution of (17) follows immediately as

\[{G}{(}{\lambda}_{k}{)} = {a}_{0} + {a}_{1}{\lambda}_{k} + \cdots + {a}_{{M}{-}{1}}{\lambda}_{k}^{{M}{-}{1}} \tag{21} \]

by having in mind that X(k) is real-valued for a symmetric shift operator like

${L}_{N}$

in (18). Although the vertex domain form of g(n) in the graph convolution

${y}{(}{n}{)} = {x}{(}{n}{)}\,{*}\,{g}{(}{n}{)}$

may be quite complex, the vertex domain implementation of such a matched filter is rather simple and follows from

${Y}{(}{k}{)} = {G}{(}{\lambda}_{k}{)}{X}{(}{k}{)}$

to yield

\[{y} = {(}{a}_{0} + {a}_{1}{L}_{N} + \cdots + {a}_{{M}{-}{1}}{L}_{N}^{{M}{-}{1}}{)}{x}{.} \tag{22} \]

For

${x} = {x}_{0}$

, we obtain the relation between the matched filter output and the initial delta pulse signal in the diffusion system.Remark 9: GMFThe vertex domain form of GMF represents the IGFT of the graph filter transfer function and is given by

\[{g}{(}{n}{)} = \mathop{\sum}\limits_{{k} = {1}}^{N}{G}{(}{\lambda}_{k}{)}{u}_{k}{(}{n}{).} \tag{23} \]

To arrive at the vector-matrix form of this GMF, denote

\[{g}_{0}{(}{n}{)} = \mathop{\sum}\limits_{{k} = {1}}^{N}{{u}_{k}}{(}{n}{).} \tag{24} \]

With

${g}_{0}$

as a vector whose elements are sums of all eigenvector elements

${u}_{k}(n)$

at a given vertex n, we have

\[{g}_{0} = {U}{1},{1} = {[}{1},{1},\ldots,{1}{]}^{T}{.}\]

In general, the elements

${g}_{0}(n)$

are nonzero for all n, making it quite different from

${x}_{0}(n)$

in (18). The matrix form of the vertex domain-matched filter then becomes

\begin{align*}{g} & = {U}{G}{(}\Lambda{)}{1} \\ & = {U}\left({{a}_{0} + {a}_{1}{\Lambda}_{N} + \cdots + {a}_{{M}{-}{1}}{\Lambda}_{N}^{{M}{-}{1}}}\right){1} \\ & = {a}_{0}{g}_{0} + {a}_{1}{L}_{N}{g}_{0} + \cdots + {a}_{{M}{-}{1}}{L}_{N}^{{M}{-}{1}}{g}_{0}{.} \tag{25} \end{align*}

Unlike in the standard time domain, where

${x}{(}{n}{)} = {g}{(}{-}{n}{),}$

although the signals x in (18) and the matched filter response g in (25) have similar forms in the spectral domain [given by (17)], they significantly differ due to the different forms of

${x}_{0}$

and

${g}_{0},$

defined, respectively, by

${x}_{0}{(}{n}{)} = {\delta}{(}{n}{-}{n}_{0}{)}$

and (24).In classical analysis, with the adjacency matrix on a directed circular graph serving as a shift operator [13], we immediately arrive at the impulse response of the classical matched filter

\begin{align*}{g}_{0}{(}{n}{)} & = \mathop{\sum}\limits_{{k} = {1}}^{N}{{u}_{k}}{(}{n}{)} \\ & = \mathop{\sum}\limits_{{k} = {1}}^{N}{{e}^{{j}{2}{\pi}{nk}{/}{N}}}{/}\sqrt{N} = {\delta}{(}{n}{)}\sqrt{N}, \\ {\lambda}_{k} & = {e}^{{-}{j}{2}{\pi}{k}{/}{N}}{/}\sqrt{N},\,{\text{and}} \\ {g}{(}{n}{)} & = {a}_{0}{\delta}{(}{n}{)} + {a}_{1}{\delta}{(}{n}{-}{1}{)} \\ & + \cdots + {a}_{{M}{-}{1}}{\delta}{(}{n}{-}{M} + {1}{).}\end{align*}

The calculation of the responses of the GMF is much more involved and will be elucidated through an intuitive example. We considered a simple undirected eight-vertex graph with unit edge weights, shown in Figure 2(a), and a unit pulse graph signal, shown in Figure 2(b). This graph was used as an irregular signal domain for the considered analysis [the exact values of the weight matrix W elements for this graph are given in Supplement B [available at https://doi.org/10.1109/MSP.2022.3207304 by relation (10).]lecturenotes02-3207304Figure 2. An example of matched filtering on an undirected unweighted graph, with two input signals

${x}_{1}$

and

${x}_{2}$

(containing different features) observed in two different scenarios. (a) The considered graph topology. (b) The unit pulse graph signal

${x}_{0}{(}{n}{)} = {\delta}{(}{n}{-}{3}{)}$

at a vertex

${n} = {3}{.}$

(c) The input signal with the first feature

${x}_{1}$

is obtained by shifting the unit pulse

${x}_{0}{(}{n}{)} = {\delta}{(}{n}{-}{3}{)}$

from the vertex

${n} = {3}$

to its one-neighborhood, with the graph shift weight of three; that is,

${x}_{1} = {x}_{0} + {3}{W}_{N}{x}_{0},$

with

${W}_{N}$

given in (26). (d) The second input signal (second feature) is obtained by shifting the unit pulse

${x}_{0}{(}{n}{)} = {\delta}{(}{n}{-}{4}{)}$

to its one-neighborhood vertices, with the graph shift weight of

${-}{2}{.}{5}{;}$

that is,

${x}_{2} = {x}_{0}{-}{2}{.}{5}{W}_{N}{x}_{0}{.}$

The graph signals from (c) and (d) are, respectively, given on a linear vertex-index axis in (e) and (f). The impulse responses of the corresponding matched filters,

${g}_{1} = {4}{g}_{0}{-}{3}{L}_{N}{g}_{0}$

and

${g}_{2} = {-}{1.5\text{g}}_{0} + {2.5\text{L}}_{N}{g}_{0},$

are shown, respectively, in (g) and (h). The outputs of the GMFs, with the impulse response

${g}_{1}{(}{n}{)}$

corresponding to the feature

${x}_{1}{(}{n}{)}$

and

${g}_{2}{(}{n}{)}$

corresponding to

${x}_{2}{(}{n}{),}$

are given, respectively, in (i) and (j). Observe from (i) that, as desired, the matched filter

${g}_{1}$

correctly detected the presence of feature

${x}_{1}$

(black line), with the maximum output at

${n} = {3},$

while the matched filter

${g}_{2}$

failed to detect feature

${x}_{1},$

as it was not designed for this purpose (green line). In (j), an analogous scenario is depicted for the detection of feature

${x}_{2},$

with the matched filter

${g}_{2}$

performing a correct detection with the maximum output at

${n} = {4}{.}$

The maximum values of the outputs of the two matched filters (the energies of the corresponding input features) are marked by larger symbols and a dotted red line, while their respective values are denoted by

${E}_{{x}_{1}}$

and

${E}_{{x}_{2}}{.}$

Example 1: Matched filter on a graphTo illustrate the operation of the matched filter on a graph, consider two signals (features) on the graph from Figure 2(a), which are created through diffusion by graph shifting a delta pulse

${x}_{0}{(}{n}{)} = {\delta}{(}{n}{-}{n}_{0}{)}$

from a vertex n ₀ to its one-neighborhood, as in the first-order system on a graph in (11).

The first signal (feature)

${x}_{1}$

is produced by a graph shift

\[{x}_{1} = {x}_{0} + {3}{W}_{N}{x}_{0} = {4}{x}_{0}{-}{3}{L}_{N}{x}_{0}\]

of a pulse input

${x}_{0}{(}{n}{)} = {\delta}{(}{n}{-}{3}{)}$

, shown in Figure 2(b).

The second signal (feature)

${x}_{2}$

employs a different shift weight

\begin{align*}{x}_{2} & = {x}_{0}{-}{2.5}{W}_{N}{x}_{0} \\ & = {-}{1.5}{x}_{0} + {2.5}{L}_{N}{x}_{0}\end{align*}

and a different pulse input

${x}_{0}{(}{n}{)} = {\delta}{(}{n}{-}{4}{)}$

.

Both features are generated based on the same

${L}_{N} = {\bf{I}}{-}{W}_{N},$

with W _N shown in (26) at the bottom of the page.

\begin{align*}{\bf{W}}_{N} = \begin{array}{c}{\left[{\begin{array}{cccccccc}{{0}}&{{0}{.}{26}}&{\frac{1}{3}}&{{0}}&{{0}}&{{0}}&{{0}}&{\frac{1}{3}}\\{{0}{.}{26}}&{{0}}&{{0}{.}{26}}&{0.22}&{\frac{1}{5}}&{{0}}&{{0}}&{0.26}\\{\frac{1}{3}}&{0.26}&{{0}}&{0.29}&{{0}}&{{0}}&{{0}}&{{0}}\\{{0}}&{0.22}&{0.29}&{{0}}&{0.22}&{0.29}&{{0}}&{{0}}\\{{0}}&{\frac{1}{5}}&{{0}}&{0.22}&{{0}}&{0.26}&{0.32}&{0.26}\\{{0}}&{{0}}&{{0}}&{0.29}&{0.26}&{{0}}&{0.41}&{{0}}\\{{0}}&{{0}}&{{0}}&{{0}}&{0.32}&{0.41}&{{0}}&{{0}}\\{\frac{1}{3}}&{0.26}&{{0}}&{{0}}&{0.26}&{{0}}&{{0}}&{{0}}\end{array}}\right]}\end{array}{.} \tag{26} \end{align*}

Therefore, the features

${x}_{1}$

and

${x}_{2}$

are generated by diffusion from a pulse located at different vertices and based on a positive weight +3 for

${x}_{1}$

and a negative weight â€“2.5 for

${x}_{2}$

. Consequently, the shapes of

${x}_{1}$

and

${x}_{2}$

are different, as shown in Figure 2(c) and (d), with their vertex-index axis representations given in Figure 2(e) and (f).Based on the general form of the vertex domain GMF in (25), the impulse responses of the GMFs

${g}_{1}$

and

${g}_{2}$

corresponding to the features

${x}_{1}$

and

${x}_{2}$

assume the form

\[{g}_{1} = {4}{g}_{0}{-}{3}{L}_{N}{g}_{0}, \tag{27} \]

\[{g}_{2} = {-}{1}{.}{5}{g}_{0} + {2}{.}{5}{L}_{N}{g}_{0} \tag{28} \]

and are depicted in Figure 2(g) and (h), with

${g}_{0}{(}{n}{)} = {\Sigma}_{{k} = {1}}^{N}{u}_{k}{(}{n}{).}$

The respective matched filter responses (outputs) are next calculated using the vertex domain graph filters

${g}_{1}$

and

${g}_{2}$

to yield

\begin{align*}{y}_{11} & = {4}{x}_{1}{-}{3}{L}_{N}{x}_{1}, \\ {y}_{12} & = {-}{1.5}{x}_{1} + {2.5}{L}_{N}{x}_{1}, \\ {y}_{21} & = {4}{x}_{2}{-}{3}{L}_{N}{x}_{2}, \\ {y}_{22} & = {-}{1.5}{x}_{2} + {2.5}{L}_{N}{x}_{2}{.} \tag{29} \end{align*}

For completeness, these results were verified through the corresponding spectral domain relations [see Supplement A (available at https://doi.org/10.1109/MSP.2022.3207304)]:

\begin{align*}&{y}_{ij}{(}{n}{)} = {x}_{\bf{I}}{(}{n}{)}\,{*}\,{g}_{j}{(}{n}{)} \\ & = {\text{IGFT}}{\{}{\text{GFT}}{\{}{x}_{\bf{I}}{(}{n}{)\}}{\text{GFT}}{\{}{g}_{j}{(}{n}{)\}\}}{.} \tag{30} \end{align*}

While the vertex form in (29) and the spectral form in (30) produced identical results, the vertex domain relation is simpler for practical realization since it only requires local signal samples from the one-neighborhood of every vertex.The outputs of the GMFs corresponding to the two features of interest are given in Figure 2(i) and (j), and, as desired, their maximum values are equal to the energies of the corresponding graph features

${x}_{1}$

and

${x}_{2}$

(the theoretical maximum output of a matched filter).For enhanced intuition and direct comparison with the classical time domain-matched filters, a similar scenario is presented on an undirected and unweighted circular graph in Figure 3.lecturenotes03-3207304Figure 3. An undirected unweighted circular graph with two features,

${x}_{1} = {[}{0},{1},{1},{1},{0},{0},{0},{0}{]}{}^{T}$

and

${x}_{2} = {[}{0},{0},{-}{0}{.}{5},{1},{-}{0}{.}{5},{0},{0},{0}{]}{}^{T}{.}$

The feature

${x}_{1} = {x} + {2}{W}_{N}{x}$

is obtained by shifting the unit pulse

${x}{(}{n}{)} = {\delta}{(}{n}{-}{3}{)}$

to both the left and right, with the shift weight of two and

${W}_{N}$

as the adjacency (weight) matrix, which was normalized by the degree matrix whose diagonal elements are equal to two. The feature

${x}_{2} = {x}{-}{W}_{N}{x}$

is produced by shifting the unit pulse

${x}{(}{n}{)} = {\delta}{(}{n}{-}{4}{)}$

with the shift weight of

${-}{1}{.}$

Notice a direct correspondence with regularly sampled time-domain analysis in standard CNNs, as the GMF operating on a circular graph directly simplifies into a standard matched filter defined for the regular time-domain signal assumed by DFT.Forward propagationWe now proceed to shed new light on the key steps of the operation of GCNNs from a matched-filter perspective. For clarity, we assume that the weights of the convolution filters (forward propagation) are already initialized or calculated. In the following, we elaborate in detail upon the forward-propagation path in GCNNs, while the adaptive learning procedure, which includes back-propagation on graphs and graph coarsening, is covered in the supplementary materials (available at https://doi.org/10.1109/MSP.2022.3207304).InputThe graph input signal

${x} = $

${[}{x}{(}{1}{),}\ldots,{x}{(}{N}{)]}^{T}$

is observed at N vertices of a graph. A common goal in GCNNs is to classify inputs into several nonoverlapping categories that correspond to specific features in data.Convolutional layerThis operation employs a convolution filter of M elements that corresponds to a GMF in (25) and is also called the convolution kernel. If we are looking for K features in x, then K GMFs are required, with the elements of the kth GMF (graph convolutional layer) given by (see also the â€œGraph Filteringâ€ section)

\begin{align*}{w}_{k} & = {[}{w}_{k}{(}{0}{),}{w}_{k}{(}{1}{),}\ldots,{w}_{k}{(}{M}{-}{1}{)]}^{T}, \\ & \qquad\qquad{k} = {1},{2},\ldots,{K}{.}\end{align*}

A common choice in GCNNs is the first-order system in (10), with a kernel/filter with M = 2 coefficients. With the normalized weight matrix as a shift operator, the

${N}\times{1}$

output signals

${y}_{k}$

from the GMFs are

\begin{align*}{y}_{k} & = {w}_{k}\star{x} = {w}_{k}{(}{0}{)}{x} \\ & + {w}_{k}{(}{1}{)}{\bf{D}}^{{-}{1}{/}{2}}{\bf{WD}}^{{-}{1}{/}{2}}{x} \\ & = {w}_{k}{(}{0}{)}{x} + {w}_{k}{(}{1}{)}{W}_{N}{x}{.} \tag{31} \end{align*}

Analogously, for M = 3, the matched filter channels within GCNNs employ a second-order system on a graph:

\begin{align*}{y}_{k} & = {w}_{k}\star{x} = {w}_{k}{(}{0}{)}{x} \\ & + {w}_{k}{(}{1}{)}{W}_{N}{x} + {w}_{k}{(}{2}{)}{W}_{N}^{2}{x}{.} \tag{32} \end{align*}

With K convolution kernels in a graph convolutional layer, the total number of the output signal elements is

${K}\times{N}{.}$

From (31), for the normalized graph Laplacian

${L}_{N}$

as the shift operator, the graph convolution filter becomes

\[{y}_{k} = {w}_{k}\star{x} = {w}_{k}{(}{0}{)}{x} + {w}_{k}{(}{1}{)}{L}_{N}{x}{.} \tag{33} \]

Remark 10The total number of convolution filter weights

${w}_{k}{(}{m}{),}{k} = {1},$

${2},\ldots,{K},{m} = $

${0},{1},\ldots,{M}{-}{1}$

in the graph convolutional layer is

${M}\times{K},$

with M as the length of the graph convolution filter and K as the number of convolution filters (features). Given that the size of the convolution filter is much smaller than the number of the input signals, i.e.,

${M}\ll{N},$

this makes GCNNs exhibit considerable computational advantages over FC GNNs, where the number of weights is

${N}\times{K}{.}$

Example 2: Relation to standard CNNsThe inputâ€“output relation of standard CNNs is a special case of that for GCNNs for circular undirected and unweighted graphs, as shown in Figure 3. This immediately follows from the elementwise form of the graph convolution filter in (31) [3]:

\begin{align*}{y}_{k}{(}{n}{)} & = {w}_{k}{(}{0}{)}{x}{(}{n}{)} \\ & + \frac{1}{2}{w}_{k}{(}{1}{)}\left[{{x}{(}{n} + {1}{)} + {x}{(}{n}{-}{1}{)}}\right]\end{align*}

where the factor 1/2 arises due to the degree matrix and can be absorbed into the weights

${w}_{k}{(}{\nu}{)}{.}$

This form is symmetric since the graph in Figure 3 is undirected. An asymmetric form may be obtained by using the adjacency matrix A of a directed unweighted circular graph instead of the normalized weight matrix

${W}_{N}$

to yield the standard CNN convolution

\begin{align*}{y}_{k}{(}{n}{)} & = {w}_{k}{(}{0}{)}{x}{(}{n}{)} + {w}_{k}{(}{1}{)}{x}{(}{n} + {1}{)} \\ & + {w}_{k}{(}{2}{)}{x}{(}{n} + {2}{)}{.}\end{align*}

BiasAs in standard NN layers, a constant bias term

${b}_{k}$

may be added at every graph convolutional layer to yield

\[{y}_{k} = {w}_{k}\star{x} + {b}_{k}\]

at the cost of one more coefficient per convolution.Nonlinear activation functionThe convolution is a linear operation; however, signals and images largely exhibit a nonlinear nature. This is catered for by applying a nonlinearity at the output of convolutional layers. The most common nonlinear activation function in CNNs and GCNNs is ReLU, defined by

\[{f}{(}{y}{)} = \max\left\{{{0},{y}}\right\}{.} \tag{34} \]

Within GCNNs, ReLU exhibits several advantages over sigmoidal-type activation functions: 1) it does not saturate for positive inputs, producing a nonzero gradient for large input values; 2) it is computationally cheap to implement; and 3) in practice, ReLU-based NNs converge faster during the learning process than those employing saturation-type nonlinearities (logistic and tanh). Moreover, the use of ReLU facilitates further computational advantages through â€œsparsification by deactivation,â€ as, from (34), neurons with negative output values are omitted. Recall, as shown in Remark 3, that ReLU can be naturally considered within our matched-filtering perspective of CNNs and GCNNs.The output of the graph convolutional layer, after the activation function and with the bias term, now becomes

\[{f}{(}{y}_{k}{)} = {f}{(}{w}_{k}\star{x} + {b}_{k}{)}{.}\]

Notice that a problem arises for many consecutive negative inputs to a neuron, whereby the corresponding zero-output of ReLU means that such a neuron will be left without its weight update (dying ReLU). This can be avoided through the leaky ReLU function, where the negative inputs to ReLU are scaled by small factors, for example,

${f}{(}{y}_{k}{(}{n}{))} = {0}{.}{01}{y}_{k}{(}{n}{),}$

for

${y}_{k}{(}{n}{)}{<}{0}{.}$

PoolingAs stated in Remark 4, to further reduce computational complexity, the output signals at GCNN layers are typically downsampledâ€”the so-called pooling operation:

\[{o}_{k} = {F}{(}{f}{(}{w}_{k}\star{x} + {b}_{k}{))}{.} \tag{35} \]

The max-pooling in GCNN considers only the maximum output of feature filters, thus considerably reducing the number of parameters, and is closely related to graph downscaling [5]. One approach to graph downscaling is termed graph coarsening and is given in Supplement B (available at https://doi.org/10.1109/MSP.2022.3207304).FlatteningAssume that for every channel k, the number of remaining vertices (signal samples) after the ReLU and max-pooling operations is

${N'}{<}{N}{.}$

Then, the K output vectors,

${o}_{k}$

in (35), can be concatenated to form the overall output vector

${o}_{F},$

of which the elements are given by

\[{o}_{F}{(}{m}{)} = {o}_{F}{((}{k}{-}{1}{)}{N'} + {n}{)} = {o}_{k}{(}{n}{)}\]

with

${k} = {1},{2},\ldots,{K},$

${n} = {1},{2},\ldots,{N'}{.}$

Therefore, with no pooling, the flattened output

${o}_{F}$

would be of size KN, while, with max-pooling with a factor of P, its size reduces to

${K}{N'} = {KN}{/}{P}{.}$

Repeated graph convolutionsTo identify quite complex patterns, like hierarchically composed features, the convolution step can be repeated one or more times before producing the overall output of the convolutional layer, prior to flattening. Each repeated convolution step may employ a different set of filter functions (features) and may or may not use the nonlinear activation and pooling stepsâ€”the so-called convolutionâ€“activationâ€“pooling chain.FC layersThe flattened output of the convolutional layer is typically fed into an FC NN layer with, e.g., a standard multilayer structure, followed by the output layer. Figure 4 illustrates the operation of a GCNN withlecturenotes04-3207304Figure 4. The principle of operation of GCNNs, based on a GCNN with one graph convolutional layer, one FC layer, and two (SoftMax) neurons at the output layer: The network is presented with noisy versions of either feature₁ or feature₂ from Figure 2(c) and (d). Shown at the top is feature₂.

a graph input x and the graph weight matrix W

one convolutional layer with weights

${w}_{k}{(}{m}{);}$

two features,

${K} = {2}{;}$

and a convolution filter length of

${M} = {2}$

one FC standard NN layer with

${KN} = {16}$

input neurons and

${S} = {2}$

output neurons

a SoftMax output layer with

${S} = {2}$

outputs.

This model is also used in Example 3, which provides an in-depth illustration of the operation of the forward path in GCNNs.Example 3: Operation of GCNNsWithout a loss of generality, the operation of GCNNs is elucidated over a two-layer network with one graph convolutional layer and one FC layer, given in Figure 4. The input signal

$\text{x}$

with

${N} = {8}$

samples represents a noisy version of either of the first or second feature from Figure 2, that is,

\begin{align*}{\bf{x}} & = {\bf{\text{feature}}}_{1} + {\bf{\varepsilon}} \\ & = {x}_{0}{-}{\bf{D}}^{{-}{1}{/}{2}}{W}{\bf{D}}^{{-}{1}{/}{2}}{x}_{0} + {\bf{\varepsilon}} \tag{36} \end{align*}

\begin{align*}{\bf{x}} & = {\bf{\text{feature}}}_{2} + {\bf{\varepsilon}} \\ & = {x}_{0} + {\bf{D}}^{{-}{1}{/}{2}}{W}{\bf{D}}^{{-}{1}{/}{2}}{x}_{0} + {\bf{\varepsilon}} \tag{37} \end{align*}

with

$\bf{\varepsilon}$

as noise, while the graph signal

${x}_{0}$

was a pulse

${x}_{0}{(}{n}{)} = $

${\delta}{(}{n}{-}{n}_{0}{)}$

at a random vertex

${n}_{0}{.}$

These features are similar to the graph signals in Figure 2(c) and (d), with the weights W given in (26). The target signal for feature ₁ was

${t}_{1} = {\left[{{1},{0}}\right]}^{T}$

and was

${t}_{2} = {\left[{{0},{1}}\right]}^{T}$

for feature ₂.The task was to confirm the presence of either of these two features in a noisy input, using graph convolution filters in (31) of length

${M} = {2},$

which corresponds to

${K} = {2}$

channels at the graph convolutional layer, followed by the ReLU nonlinearity in (7). The FC layer employed the SoftMax output with target values that correspond to the two patterns in the target signal t. For more detail, see Supplement C (available at https://doi.org/10.1109/MSP.2022.3207304).Training was performed based on 200 random realizations of the input signal

$\text{x},$

which, for each realization, randomly assumed either feature ₁ or feature ₂ and a random central vertex

${n}_{0}{.}$

This cycle of 200 realizations is called an epoch. The GCNN in Figure 4 was trained over five epochs; that is, the same set of 200 random realizations was employed five times, and no max-pooling was employed. Figure 5 illustrates the evolution of the GCNN output and the weights of the output FC layer along the training process.lecturenotes05-3207304Figure 5. The operation of the GCNN from Figure 4, with one graph convolutional layer,

${K} = {2}$

convolution kernels, one FC layer, and two neurons at the output SoftMax layer, to yield

${(}{NK}{)}\times{2} = {16}\times{2} = {32}$

weights. The task was to identify the two features from Example 1, also given in Figure 2(c) and (d). (a) The output probabilities of the GCNN for the task of identification of the presence of a correct feature (either feature₁ or feature₂) are denoted by a black â€œ+â€ if the corresponding SoftMax output should be equal to one and by a green â€œ

${\cdot}$

â€ if the SoftMax output should be zero. (b) The evolution of the 32 weights in the FC layer along the training process. (c) The test results over 100 random realizations of

$\text{x},$

with â€œ+â€ and â€œ

${\cdot}$

â€ as described. Observe an almost perfect identification of the two patterns in

$\text{x}.$

The network output represents the â€œprobabilitiesâ€ of the presence of the two features in the noisy input, denoted, respectively, by

${P}_{1}$

and

${P}_{2},$

which are given in Figure 5(a) and were obtained from the two SoftMax outputs of the GCNN from Figure 4. Therefore, when feature ₁ is present in the noisy input

$x$

, the target signal is

${t}_{1} = {\left[{{1},{0}}\right]}^{T},$

and the output of the SoftMax layer in an ideal case should be close to

${P}_{1} = {1}$

and

${P}_{2} = {0}{.}$

Regarding the presence of feature ₂ in

$\text{x},$

the corresponding target signal is

${t}_{2} = {\left[{{0},{1}}\right]}^{T},$

and the SoftMax outputs should ideally approach

${P}_{1} = {0}$

and

${P}_{2} = {1}{.}$

The evolution of the SoftMax output during training is illustrated in Figure 5(a), where, for a consistent treatment of the overall accuracy,

for target

${t}_{1},$

the black â€œ+â€ denotes the values of

${P}_{1},$

and the green â€œ

${\cdot}$

â€ denotes the values of

${P}_{2}$

for target

${t}_{2},$

the black â€œ+â€ denotes the values of

${P}_{2},$

and the green â€œ

${\cdot}$

â€ denotes the values of

${P}_{1}$

.

In this way, a perfectly trained GCNN will have all black â€œ+â€ marks at one and all green â€œ

${\cdot}$

â€ marks at zero. Note that, for the considered scenario,

${P}_{1} + {P}_{2} = {1}$

always holds for the SoftMax outputs.The graph back-propagation approach to the weight update in GNCCs is elaborated in detail in Supplement C (available at https://doi.org/10.1109/MSP.2022.3207304).The output convergence is depicted in Figure 5(a). Observe the success of training, as the network outputs initially assumed indecisive values around 0.5, and, subsequently (over iterations with random pulse positions and feature choice), the black â€œ+â€ gradually approached the correct value of one and the green â€œ

${\cdot}$

â€ the correct value of zero.The weight convergence of the

${K}\times{N}\times{S} = {32}$

weights in the FC layer, described in Figure 5(b), reflects the overall convergence of the GCNN. An almost flat weight evolution after the initial 200 random realizations of

$\text{x}$

indicates successful training.In the testing stage, the success of GCNN training was evaluated over 100 new random realizations of the noisy input

$\text{x}.$

Figure 5(c) indicates the correct and highly reliable GCNN decisions, as, over all 100 new random inputs presented to the network, the black â€œ+â€ and green â€œ

${\cdot}$

â€ were at (or very close to) their correct positions (successful training).All of the steps and calculations in the training process are illustrated in a step-by-step fashion in Supplement D (available at https://doi.org/10.1109/MSP.2022.3207304).What we have learnedGCNNs have been developed as extensions of standard CNNs with the aim to cater for irregular domains represented by connected graphs. Despite facilitating the transition from NNs to GNNs, such an approach harbors intrinsic disadvantagesâ€”as much of the effort has revolved around the issues surrounding domain adaptation rather than resorting to first principles. To this end, we have revisited GCNNs starting from the notion of a system on a graph, which has served to establish a matched-filtering interpretation of the whole convolutionâ€“activationâ€“pooling chain within GCNNs while inheriting the rigor and intuition from signal detection theory. Such an approach is shown to be quite general and yields both standard CNNs and FC NNs as special cases. It is our hope that, by revisiting the underpinning principles of GCNNs through the lens of matched filtering, we have helped shed new light onto the otherwise black-box approach to GCNNs, together with demystifying GCNNs to practitioners and for educational purposes. We also hope that the resulting well-motivated and physically meaningful interpretation at every step of the operation and adaptation of GCNNs will help establish a common language between the diverse communities working on data analytics on graphs.AcknowledgmentThis article has supplementary downloadable material available at https://doi.org/10.1109/MSP.2022.3207304, which contains more details of the key underlying concepts, together with a step-by-step elaboration of back-propagation training of GCNNs. Contact ljubisa@ucg.ac.me for further questions about this work.AuthorsLjubiÅ¡a StankovicÂ´ (ljubisa@ac.me) is a professor at the University of Montenegro, Podgorica 81000, Montenegro. He was an Alexander von Humbolt fellow at Ruhr University Bochum, Germany, from 1997 to 1999, the rector of the University of Montenegro from 2003 to 2008, and a visiting academic at Imperial College London, U.K., from 2012 to 2013. He has published almost 200 journal papers and is a member of the National Academy of Science and Arts and of Academia Europaea. He won the Best Paper Award from EURASIP in 2017 and the IEEE Signal Processing Magazine Best Column Award for 2020. He is a Fellow of IEEE.Danilo P. Mandic (d.mandic@imperial.ac.uk) received his Ph.D. degree in machine intelligence from Imperial College London in 1999. He is a professor in the Department of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ London, U.K. He was the recipient of the Dennis Gabor Award for Outstanding Contributions in Neural Engineering by the International Neural Networks Society in 2019, the 2018 IEEE Signal Processing Magazine Best Paper Award, the ICASSP 2021 Outstanding Paper Award, and the Presidentâ€™s Award for Excellence in Postgraduate Supervision at Imperial College in 2014. He is a Fellow of IEEE.References[1] C.-C. J. Kuo, â€œUnderstanding convolutional neural networks with a mathematical model,â€ J. Vis. Commun. Image Represent., vol. 41, pp. 406â€“413, Nov. 2016, doi: 10.1016/j.jvcir.2016.11.003.[2] S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, and D. J. Inman, â€œ1D convolutional neural networks and applications: A survey,â€ Mech. Syst. Signal Process., vol. 151, Apr. 2021, Art. no. 107398, doi: 10.1016/j.ymssp.2020.107398.[3] L. Stankovic and D. Mandic, â€œConvolutional neural networks demystified: A matched filtering perspective based tutorial,â€ 2021, arXiv:2108.11663.[4] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, â€œGeometric deep learning: Going beyond Euclidean data,â€ IEEE Signal Process. Mag., vol. 34, no. 4, pp. 18â€“42, Jul. 2017, doi: 10.1109/MSP.2017.2693418.[5] L. StankoviÄ‡ et al., â€œData analytics on graphs part III: Machine learning on graphs, from graph topology to applications,â€ Found. TrendsÂ® Mach. Learn., vol. 13, no. 4, pp. 332â€“530, Dec. 2020, doi: 10.1561/2200000078-3.[6] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, â€œThe graph neural network model,â€ IEEE Trans. Neural Netw., vol. 20, no. 1, pp. 61â€“80, Jan. 2009, doi: 10.1109/TNN.2008.2005605.[7] A. Micheli, â€œNeural network for graphs: A contextual constructive approach,â€ IEEE Trans. Neural Netw., vol. 20, no. 3, pp. 498â€“511, Mar. 2009, doi: 10.1109/TNN.2008.2010350.[8] M. Niepert, M. Ahmed, and K. Kutzkov, â€œLearning convolutional neural networks for graphs,â€ in Proc. 33rd Int. Conf. Mach. Learn. (ICML), 2016, pp. 2014â€“2023.[9] F. Gama, A. G. Marques, G. Leus, and A. Ribeiro, â€œConvolutional neural network architectures for signals supported on graphs,â€ IEEE Trans. Signal Process., vol. 67, no. 4, pp. 1034â€“1049, Feb. 2019, doi: 10.1109/TSP.2018.2887403.[10] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, and M. Sun, â€œGraph neural networks: A review of methods and applications,â€ 2018, arXiv:1812.08434.[11] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu, â€œA comprehensive survey on graph neural networks,â€ IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 1, pp. 4â€“24, Jan. 2021, doi: 10.1109/TNNLS.2020.2978386.[12] L. Stankovic, D. P. Mandic, M. Dakovic, I. Kisil, E. Sejdic, and A. G. Constantinides, â€œUnderstanding the basis of graph signal processing via an intuitive example-driven approach [Lecture Notes] ,â€ IEEE Signal Process. Mag., vol. 36, no. 6, pp. 133â€“145, Nov. 2019, doi: 10.1109/MSP.2019.2929832.[13] L. StankoviÄ‡ et al., â€œData analytics on graphs part II: Signals on graphs,â€ Found. TrendsÂ® Mach. Learn., vol. 13, nos. 2â€“3, pp. 158â€“331, Dec. 2020, doi: 10.1561/2200000078-2.[14] M. Gori, G. Monfardini, and F. Scarselli, â€œA new model for learning in graph domains,â€ in Proc. 2005 IEEE Int. Joint Conf. Neural Netw., vol. 2, pp. 729â€“734, doi: 10.1109/IJCNN.2005.1555942.[15] T. N. Kipf and M. Welling, â€œSemi-supervised classification with graph convolutional networks,â€ 2016, arXiv:1609.02907.[16] F. Gama, E. Isufi, G. Leus, and A. Ribeiro, â€œGraphs, convolutions, and neural networks: From graph filters to graph neural networks,â€ IEEE Signal Process. Mag., vol. 37, no. 6, pp. 128â€“138, Nov. 2020, doi: 10.1109/MSP.2020.3016143.[17] A. Sandryhaila and J. M. Moura, â€œDiscrete signal processing on graphs,â€ IEEE Trans. Signal Process., vol. 61, no. 7, pp. 1644â€“1656, Apr. 2013, doi: 10.1109/TSP.2013.2238935.Digital Object Identifier 10.1109/MSP.2022.3207304CoverSubmit your paper todayMastheadOpen and Reproducible Science: Desirable or Even Mandatory, But Not So Simple!Reaching Out to Members in the Middle East and IndiaBoston Chapter Receives the 2022 Chapter of the Year Award!2022 IEEE Signal Processing Society AwardsPhysics-Driven Machine Learning for Computational Imaging: Part 2ICIP 2023ICIP 2023 2Physics-Embedded Machine Learning for Electromagnetic Data ImagingPhysics-Guided Terahertz Computational ImagingUnfolding-Aided Bootstrapped Phase Retrieval in Optical ImagingIntegration of Physics-Based and Data-Driven Models for Hyperspectral Image UnmixingDeep Optical Coding Design in Computational ImagingPhysics-/Model-Based and Data-Driven Methods for Low-Dose Computed TomographyHigh-Dimensional MR Spatiospectral Imaging by Integrating Physics-Based ModelingPhysics-Driven Deep Learning Methods for Fast Quantitative Magnetic Resonance ImagingPhysics-Driven Synthetic Data Learning for Biomedical Magnetic ResonanceA Guide to Computational Reproducibility in Signal Processing and Machine LearningUpdate on the CIC Multistage Decimation FilterUnderstanding the Basis of Graph Convolutional Neural NetworksExplainable Machine LearningDates AheadMathWorksArchives