Information

Radial Basis Function Network (RBF Network)

Radial Basis Function Network (RBF Network)


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

In the Wikipedia article on radial basis function network, I didn't understand what was meant by "center vector for neuron i", in other words "center of the RBF units called also prototype".


In RBF networks each neuron in the hidden layer applies a computation that is related to its "center vector".

First consider the set of input neurons as a vector $mathbf{x}$, and that each hidden layer neuron receives the complete input vector as its own input.

Second, each hidden layer neuron $i$ is parametrised through a vector (center vector) $mathbf{c_i}$ of equal dimension than $mathbf{x}$.

The computation of each hidden layer neuron $i$ consists of:

  1. Evaluate the distance (according to a metric, which may or may not be Euclidean) between the input $mathbf{x}$ and the center vector $mathbf{c_i}$. When the input is equal to the center vector, the output for that neuron will be maximal (see 2).

  2. Evaluate a Gaussian function that decays with increasing distance between the vectors. That's the output of each hidden layer neuron. In the following equation, $eta$ is a parameter that specifies the decay rate of the Gaussian (converse to the standard deviation in a regular Gaussian distribution).

egin{equation} ho_i(mathbf{x})=exp{[-etaVertmathbf{x}-mathbf{c_i}Vert^2]} end{equation}

Now combine linearly the output of all hidden layer neurons to obtain the total neural network output. For that, you need an additional set of parameters or weights $a_i$ (one $a_i$ for each hidden layer neuron, but the $a_i$ values are scalars while each $mathbf{c_i}$ is a vector).

The final equation, that yields the total network output (a scalar value), is:

egin{equation} varphi(mathbf{x}) = sum_{i=1}^N a_i ho(||mathbf{x}-mathbf{c}_i||) end{equation}

And that's it.


Radial Basis Function Networks

A radial basis function (RBF) is a function that assigns a real value to each input from its domain (it is a real-value function), and the value produced by the RBF is always an absolute value i.e. it is a measure of distance and cannot be negative.

Euclidean distance, the straight-line distance between two points in Euclidean space, is typically used.

Radial basis functions are used to approximate functions, much as neural networks act as function approximators. The following sum:

represents a radial basis function network. The radial basis functions act as activation functions.

The approximant f(x) is differentiable with respect to the weights W , which are learned using iterative updater methods commong among neural networks.

Further Reading

Chris Nicholson

Chris Nicholson is the CEO of Pathmind. He previously led communications and recruiting at the Sequoia-backed robo-advisor, FutureAdvisor, which was acquired by BlackRock. In a prior life, Chris spent a decade reporting on tech and finance for The New York Times, Businessweek and Bloomberg, among others.


Radial Basis Function Network (RBFN) Tutorial

A Radial Basis Function Network (RBFN) is a particular type of neural network. In this article, I’ll be describing it’s use as a non-linear classifier.

Generally, when people talk about neural networks or “Artificial Neural Networks” they are referring to the Multilayer Perceptron (MLP). Each neuron in an MLP takes the weighted sum of its input values. That is, each input value is multiplied by a coefficient, and the results are all summed together. A single MLP neuron is a simple linear classifier, but complex non-linear classifiers can be built by combining these neurons into a network.

To me, the RBFN approach is more intuitive than the MLP. An RBFN performs classification by measuring the input’s similarity to examples from the training set. Each RBFN neuron stores a “prototype”, which is just one of the examples from the training set. When we want to classify a new input, each neuron computes the Euclidean distance between the input and its prototype. Roughly speaking, if the input more closely resembles the class A prototypes than the class B prototypes, it is classified as class A.

RBF Network Architecture

The above illustration shows the typical architecture of an RBF Network. It consists of an input vector, a layer of RBF neurons, and an output layer with one node per category or class of data.

The Input Vector

The input vector is the n-dimensional vector that you are trying to classify. The entire input vector is shown to each of the RBF neurons.

The RBF Neurons

Each RBF neuron stores a “prototype” vector which is just one of the vectors from the training set. Each RBF neuron compares the input vector to its prototype, and outputs a value between 0 and 1 which is a measure of similarity. If the input is equal to the prototype, then the output of that RBF neuron will be 1. As the distance between the input and prototype grows, the response falls off exponentially towards 0. The shape of the RBF neuron’s response is a bell curve, as illustrated in the network architecture diagram.

The neuron’s response value is also called its “activation” value.

The prototype vector is also often called the neuron’s “center”, since it’s the value at the center of the bell curve.

The Output Nodes

The output of the network consists of a set of nodes, one per category that we are trying to classify. Each output node computes a sort of score for the associated category. Typically, a classification decision is made by assigning the input to the category with the highest score.

The score is computed by taking a weighted sum of the activation values from every RBF neuron. By weighted sum we mean that an output node associates a weight value with each of the RBF neurons, and multiplies the neuron’s activation by this weight before adding it to the total response.

Because each output node is computing the score for a different category, every output node has its own set of weights. The output node will typically give a positive weight to the RBF neurons that belong to its category, and a negative weight to the others.

RBF Neuron Activation Function

Each RBF neuron computes a measure of the similarity between the input and its prototype vector (taken from the training set). Input vectors which are more similar to the prototype return a result closer to 1. There are different possible choices of similarity functions, but the most popular is based on the Gaussian. Below is the equation for a Gaussian with a one-dimensional input.

Where x is the input, mu is the mean, and sigma is the standard deviation. This produces the familiar bell curve shown below, which is centered at the mean, mu (in the below plot the mean is 5 and sigma is 1).

The RBF neuron activation function is slightly different, and is typically written as:

In the Gaussian distribution, mu refers to the mean of the distribution. Here, it is the prototype vector which is at the center of the bell curve.

For the activation function, phi, we aren’t directly interested in the value of the standard deviation, sigma, so we make a couple simplifying modifications.

The first change is that we’ve removed the outer coefficient, 1 / (sigma * sqrt(2 * pi)). This term normally controls the height of the Gaussian. Here, though, it is redundant with the weights applied by the output nodes. During training, the output nodes will learn the correct coefficient or “weight” to apply to the neuron’s response.

The second change is that we’ve replaced the inner coefficient, 1 / (2 * sigma^2), with a single parameter ‘beta’. This beta coefficient controls the width of the bell curve. Again, in this context, we don’t care about the value of sigma, we just care that there’s some coefficient which is controlling the width of the bell curve. So we simplify the equation by replacing the term with a single variable.

RBF Neuron activation for different values of beta

There is also a slight change in notation here when we apply the equation to n-dimensional vectors. The double bar notation in the activation equation indicates that we are taking the Euclidean distance between x and mu, and squaring the result. For the 1-dimensional Gaussian, this simplifies to just (x - mu)^2.

It’s important to note that the underlying metric here for evaluating the similarity between an input vector and a prototype is the Euclidean distance between the two vectors.

Also, each RBF neuron will produce its largest response when the input is equal to the prototype vector. This allows to take it as a measure of similarity, and sum the results from all of the RBF neurons.

As we move out from the prototype vector, the response falls off exponentially. Recall from the RBFN architecture illustration that the output node for each category takes the weighted sum of every RBF neuron in the network–in other words, every neuron in the network will have some influence over the classification decision. The exponential fall off of the activation function, however, means that the neurons whose prototypes are far from the input vector will actually contribute very little to the result.

If you are interested in gaining a deeper understanding of how the Gaussian equation produces this bell curve shape, check out my post on the Gaussian Kernel.

Example Dataset

Before going into the details on training an RBFN, let’s look at a fully trained example.

In the below dataset, we have two dimensional data points which belong to one of two classes, indicated by the blue x’s and red circles. I’ve trained an RBF Network with 20 RBF neurons on this data set. The prototypes selected are marked by black asterisks.

We can also visualize the category 1 (red circle) score over the input space. We could do this with a 3D mesh, or a contour plot like the one below. The contour plot is like a topographical map.

The areas where the category 1 score is highest are colored dark red, and the areas where the score is lowest are dark blue. The values range from -0.2 to 1.38.

I’ve included the positions of the prototypes again as black asterisks. You can see how the hills in the output values are centered around these prototypes.

It’s also interesting to look at the weights used by output nodes to remove some of the mystery.

For the category 1 output node, all of the weights for the category 2 RBF neurons are negative:

And all of the weights for category 1 RBF neurons are positive:

Finally, we can plot an approximation of the decision boundary (the line where the category 1 and category 2 scores are equal).

To plot the decision boundary, I’ve computed the scores over a finite grid. As a result, the decision boundary is jagged. I believe the true decision boundary would be smoother.

Training The RBFN

The training process for an RBFN consists of selecting three sets of parameters: the prototypes (mu) and beta coefficient for each of the RBF neurons, and the matrix of output weights between the RBF neurons and the output nodes.

There are many possible approaches to selecting the prototypes and their variances. The following paper provides an overview of common approaches to training RBFNs. I read through it to familiarize myself with some of the details of RBF training, and chose specific approaches from it that made the most sense to me.

It seems like there’s pretty much no “wrong” way to select the prototypes for the RBF neurons. In fact, two possible approaches are to create an RBF neuron for every training example, or to just randomly select k prototypes from the training data. The reason the requirements are so loose is that, given enough RBF neurons, an RBFN can define any arbitrarily complex decision boundary. In other words, you can always improve its accuracy by using more RBF neurons.

What it really comes down to is a question of efficiency–more RBF neurons means more compute time, so it’s ideal if we can achieve good accuracy using as few RBF neurons as possible.

One of the approaches for making an intelligent selection of prototypes is to perform k-Means clustering on your training set and to use the cluster centers as the prototypes. I won’t describe k-Means clustering in detail here, but it’s a fairly straight forward algorithm that you can find good tutorials for.

When applying k-means, we first want to separate the training examples by category–we don’t want the clusters to include data points from multiple classes.

Here again is the example data set with the selected prototypes. I ran k-means clustering with a k of 10 twice, once for the first class, and again for the second class, giving me a total of 20 clusters. Again, the cluster centers are marked with a black asterisk ‘*’.

I’ve been claiming that the prototypes are just examples from the training set–here you can see that’s not technically true. The cluster centers are computed as the average of all of the points in the cluster.

How many clusters to pick per class has to be determined “heuristically”. Higher values of k mean more prototypes, which enables a more complex decision boundary but also means more computations to evaluate the network.

Selecting Beta Values

If you use k-means clustering to select your prototypes, then one simple method for specifying the beta coefficients is to set sigma equal to the average distance between all points in the cluster and the cluster center.

Here, mu is the cluster centroid, m is the number of training samples belonging to this cluster, and x_i is the ith training sample in the cluster.

Once we have the sigma value for the cluster, we compute beta as:

Output Weights

The final set of parameters to train are the output weights. These can be trained using gradient descent (also known as least mean squares).

First, for every data point in your training set, compute the activation values of the RBF neurons. These activation values become the training inputs to gradient descent.

The linear equation needs a bias term, so we always add a fixed value of ‘1’ to the beginning of the vector of activation values.

Gradient descent must be run separately for each output node (that is, for each class in your data set).

For the output labels, use the value ‘1’ for samples that belong to the same category as the output node, and ‘0’ for all other samples. For example, if our data set has three classes, and we’re learning the weights for output node 3, then all category 3 examples should be labeled as ‘1’ and all category 1 and 2 examples should be labeled as 0.

RBFN as a Neural Network

So far, I’ve avoided using some of the typical neural network nomenclature to describe RBFNs. Since most papers do use neural network terminology when talking about RBFNs, I thought I’d provide some explanation on that here. Below is another version of the RBFN architecture diagram.

Here the RBFN is viewed as a “3-layer network” where the input vector is the first layer, the second “hidden” layer is the RBF neurons, and the third layer is the output layer containing linear combination neurons.

One bit of terminology that really had me confused for a while is that the prototype vectors used by the RBFN neurons are sometimes referred to as the “input weights”. I generally think of weights as being coefficients, meaning that the weights will be multiplied against an input value. Here, though, we’re computing the distance between the input vector and the “input weights” (the prototype vector).


References

Poggio, T. & Girosi, F. (1989), 'A Theory of Networks for Approximation and Learning'(A.I. Memo No.1140, C.B.I.P. Paper No. 31), Technical report, MIT ARTIFICIAL INTELLIGENCE LABORATORY.

Vogt, M. (1992), 'Implementierung und Anwendung von Generalized Radial Basis Functions in einem Simulator neuronaler Netze', Master's thesis, IPVR, University of Stuttgart. (in German)

Zell, A. et al. (1998), 'SNNS Stuttgart Neural Network Simulator User Manual, Version 4.2', IPVR, University of Stuttgart and WSI, University of T<U+00FC>bingen. http://www.ra.cs.uni-tuebingen.de/SNNS/welcome.html

Zell, A. (1994), Simulation Neuronaler Netze, Addison-Wesley. (in German)


Radial Basis Functions, RBF Kernels, & RBF Networks Explained Simply

Here is a set of one-dimensional data: your task is to find a way to perfectly separate the data into two classes with one line.

At first glance, this may appear to be an impossible task, but it is only so if we restrict ourselves to one dimension.

Let’s introduce a wavy function f(x) and map each value of x to its corresponding output. Conveniently, this makes all the blue points higher and the red points lower at just the right locations. We can then draw a horizontal line that cleanly divides the classes into two parts.

This solution seems very sneaky, but we can actually generalize it with the help of radial basis functions (RBFs). Although they have many specialized use cases, an RBF inherently is simply a function whose points are defined as distances from a center. Methods that use RBFs fundamentally share a learning paradigm different from the standard machine learning fare, which is what makes them so powerful.

For example, the Bell Curve is an example of a RBF, since points are represented as number of standard deviations from the mean. Formally, we may define an RBF as a function that can be written as:

Note the double pipes (informally, in this use case) represent the idea of ‘distance’, regardless the dimension of x. For example,

  • this would be absolute value in one dimension: f(-3) = f(3) . The distance to the origin (0) is 3 regardless of the sign.
  • this would be Euclidean distance in two dimensions: f([-3,4]) = f([3,-4]) . The distance to the origin (0, 0) is 5 units regardless of the specific point’s location.

This is the ‘radius’ aspect of the ‘radial basis function’. One can say that radial basis functions are symmetrical around the origin.

The task mentioned above — magically separating points with one line — is known as the radial basis function kernel, with applications in the powerful Support Vector Machine (SVM) algorithm. The purpose of a ‘kernel trick’ is to project the original points into some new dimensionality such that it becomes easier to separate through simple linear methods.

Take a simpler example of the task with three points.

Let’s draw a normal distribution (or another arbitrary RBF function) centered at each of the points.

Then, we can flip all the radial basis functions for data points of one class.

If we add all the values of the radial basis functions at each point x, we an intermediate ‘global’ function that looks something like this:

We’ve attained our wavy global function (let’s call it g(x) )! It works with all sorts of data layouts, because of the nature of the RBF function.

Our RBF function of choice — the normal distribution — is dense in one central area and less so in all other places. Hence, it has a lot of sway in deciding the value of g(x) when values of x are near its location, with diminishing power as the distance increases. This property makes RBF functions powerful.

When we map every original point at location x to the point (x, g(x)) in two-dimensional space, the data can always be reliably separated, provided it is not too noisy. It will always be mapped in accordance with proper density of the data because of overlapping RBF functions.

In fact, linear combinations of— adding and multiplying — Radial Basis Functions can be used to approximate almost any function well.

Radial Basis Networks take this idea to heart by incorporating ‘radial basis neurons’ in a simple two-layer network.

The input vector is the n-dimensional input in which a classification or regression task (only one output neuron) is being performed on. A copy of the input vector is sent to each of the following radial basis neurons.

Each RBF neuron stores a ‘central’ vector — this is simply one unique vector from the training set. The input vector is compared to the central vector, and the difference is plugged into an RBF function. For example, if the central and input vectors were the same, the difference would be zero. The normal distribution at x = 0 is 1, so the output of the neuron would be 1.

Hence, the ‘central’ vector is the vector at the center of RBF function, since it is the input that yields the peak output.

Likewise, if the central and input vectors are different, the output of the neuron decays exponentially towards zero. The RBF neuron, then, can be thought of as a nonlinear measure of similarity between the input and central vectors. Because the neuron is radial — radius-based — the difference vector’s magnitude, not direction, matters.

Lastly, the learnings from the RBF nodes are weighted and summed through a simple connection to the output layer. Output nodes give large weight values to RBF neurons that have specific importance to a category, and smaller weights for neurons whose outputs matter less.

Why does the radial basis network take a ‘similarity’ approach to modelling? Take the following example two-dimensional dataset, where the central vectors of twenty RBF nodes are represented with a ‘+’.

Then, look at a contour map of the the prediction space for the trained RBF network: around almost every central vector (or group of central vectors) is a peak or a valley. The feature space of the network is ‘defined’ by these vectors, just like how the global function g(x) discussed in RBF kernels is formed by radial basis functions centered at each data point.

Because it is impractical to form one RBF node for every single item in the training set like kernels do, radial basis networks chose central vectors to shape the network’s view of the landscape. These central vectors are usually found through some clustering algorithm like K-Means, or alternatively simply through random sampling.

The drawn feature boundary based on height looks like this:

The radial basis network fundamentally approaches the task of classification differently than standard neural networks because of the usage of a radial basis function, which can be thought of as measuring density. Standard neural networks seek to separate the data through linear manipulations of activation functions, whereas radial basis functions seek more to group the data through fundamentally ‘density’-based transformations.

Because of this, as well as its lightweight architecture and strong nonlinearity, it is a top contender with artificial neural networks.

Fundamentally, applications of radial basis functions rely on a concept called ‘radial basis function interpolation’, which is a topic of great interest in approximation theory, or the study of approximating functions efficiently.

As mentioned previously, RBFs are a mathematical embodiment of the idea that a point should have the most influence at that point and decaying influence for increasing distances from that point. Because of this, they can be manipulated in very simple ways to construct complex nonlinearities.


Radial Basis Function Network (RBF Network) - Biology

ERM is cool, but so far all classifiers are linear. What if there exists no linear decision boundary?

Question: Do you know a non-linear model from class?

  • k-NN:
    • Classification: $h(mathbf) = ext (sum _^k y_i )$
    • Regression: $ h(mathbf) = frac<1>(sum _^k y_i )$

    What if we use all n training data points and a weighting scheme, such that data points further away contribute less to the prediction?

    Radial Basis Functions (RBF)

    Use an RBF (or kernel) to quantify the contribution with respect to the distance to the test point. Usually $mathsf(mathbf, mathbf) = g(underbrace - mathbf|>>_<= z>)$, where the scale parameter $r$ regulates the width of the kernel.

    • Gaussian kernel: $g(z) = e^<-frac<1><2>z^2>$
    • Window kernel: $g(z) = left< egin1 & extrm< if $z leq 1$> 0 & extrm < if $z >1$> end ight. $ This model is also knows as $epsilon$-NN, where $epsilon$ = r.

    A Prediction Model: "Nadaraya-Watson" model or kernel regression

    Use a weighted sum of the $y$-values:

    $h(mathbf) = frac^n a_ (mathbf) cdot y_i>^n a_ (mathbf)>, ext < with >a_i(mathbf) = mathsf(mathbf, mathbf)$

    $Rightarrow$ non-parametric version (one bump at x) of the RBF network

    Illustration:

    $ h(mathbf) = sum_^n w_i(mathbf)cdot mathsf(mathbf, mathbf), ext w_i = frac^n mathsf(mathbf, mathbf)> $

    Center a bump at every $mathbf$ with height $w_i(x)$, where the width is determined by r

    Note: normalization constants $<2 pi>^<-frac<2>>$ or $frac<2>+1)><2>>>$ is not needed (unless you use it for density estimation).

    Radial Basis Function Networks

    Note: $w_i(dot>)$ varies depending on $dot>$ (test point)

    Possible Simplification

    Fix heights for all test points to $w_i$

    $Rightarrow$ parametric version of the RBF network (fit $w_i$ by minimizing the training error)

    Illustration:

    Big Surprise!

    $ h(mathbf) = mathbf^ mathbf ext < with >mathbf = egin mathsf(mathbf, mathbf) vdots mathsf(mathbf, mathbf) end $

    We have transformed the $d$-dimensional non-linear model to an $n$-dimensional linear model!

    Note: The non-linear transformation is defined by the kernel $mathsf$, and the training data points $x_i$

    Choosing $n$ parameters will load to overfitting $n$ parameters and n data points, which is exact fit. But we have noisy data. For RBF network we choose $k RBF network: For comparison: kNN decision boundaries are less smooth.
    k = 1: k = 3:


    If you read on to the Training section in your link, it explains what the centre vectors are:

    Reading the above, it seems to me that you have your set of samples, the x's, and from these you choose a number of centre vectors - one for each neuron in the hidden layer . The centre vectors, broadly speaking are centres of clusters in your sample data.

    As the remarks say, you can use an unsupervised clustering algorithm, such as k-means, to find n cluster centres in your data, where n is the number of neurons in the hidden layer that you are dealing with. Different layers may have more or less neurons and so will have correspondingly more or less centre vectors.

    Then the RBF relates each individual sample, x, to each centre vector by some function of the Euclidean distance between them.


    Details

    RBF networks are feed-forward networks with one hidden layer. Their activation is not sigmoid (as in MLP), but radially symmetric (often gaussian). Thereby, information is represented locally in the network (in contrast to MLP, where it is globally represented). Advantages of RBF networks in comparison to MLPs are mainly, that the networks are more interpretable, training ought to be easier and faster, and the network only activates in areas of the feature space where it was actually trained, and has therewith the possibility to indicate that it "just doesn't know".

    Initialization of an RBF network can be difficult and require prior knowledge. Before use of this function, you might want to read pp 172-183 of the SNNS User Manual 4.2. The initialization is performed in the current implementation by a call to RBF_Weights_Kohonen(0,0,0,0,0) and a successive call to the given initFunc (usually RBF_Weights ). If this initialization doesn't fit your needs, you should use the RSNNS low-level interface to implement your own one. Have a look then at the demos/examples. Also, we note that depending on whether linear or logistic output is chosen, the initialization parameters have to be different (normally c(0,1. ) for linear and c(-4,4. ) for logistic output).


    Radial Basis Function Network (RBF Network) - Biology

    The radial basis function (RBF) networks are inspired by biological neural systems, in which neurons are organized hierarchically in various pathways for signal processing, and they tuned to respond selectively to different features/characteristics of the stimuli within their respective fields. In general, neurons in higher layers have larger receptive fields and they selectively respond to more global and complex patterns.

    • Neurons in the primary visual cortex (V1) receive visual input from the retina and selectively respond to different orientations of linear features
    • Neurons in the middle temporal (MT) area receive visual input from the V1 area and selectively respond to different motion directions
    • Neurons in the medial superior temporal area (MST) receive visual input from the MT area and selectively respond to different motion patterns (optic flow) such as rotation, expansion, contraction, and spiral motions.

    The tuning curves , the local response functions, of these neurons are typically Gaussian, i.e., the level of response is reduced when the stimulus becomes less similar to what the cell most is most sensitive and responsive to (the most preferred).

    These Gaussian-like functions can also be treated as a set of basis functions (not necessarily orthogonal and over-complete) that span the space of all input patters. Based on such local features represented by the nodes, a node in a higher layer can be trained to selectively respond to some patterns/objects (e.g., ``grandmother cell''), based on the outputs of the nodes in the lower layer.

      The RBF network can be used in pattern classification, by which a given pattern vector is classified into one of classes. The classification is typically supervised, i.e., the network is trained based on a set of training patterns ( ), where indicates the class the kth pattern belongs, i.e., .

    As seen in the examples above, an RBF network is typically composed of three layers, the input layer composed of nodes that receive the input signal , the hidden layer composed of nodes that simulate the neurons with selective tuning to different features in the input, and the output layer composed of nodes simulating the neurons at some higher level that respond to features at a more global level, based on the output from the hidden layer representing different features at a local level. (This could be considered as a model for the visual signal processing in the pathway .)

    Upon receiving an input pattern vector , the jth hidden node reaches the activation level:

    where and are respectively the mean vector and covariance matrix associated with the jth hidden node. In particular, the covariance matrix is a special diagonal matrix , then the Gaussian function becomes isotropic and we have

    We see that represents the preferred feature (orientation, motion direction, frequency, etc.) of the jth neuron. When , the response of the neuron is maximized due to the selectivity of the neuron.

    In the output layer, each node receives the outputs of all nodes in the hidden layer, and the output of the ith output node is the linear combination of the net activation:

    Note that the computation at the hidden layer is non-linear but that at the output layer is linear, i.e., this is a hybrid training scheme.

    Through the training stage, various system parameters of an RBF network will be obtained, including the and ( ) of the nodes of the hidden layer, as well as the weights ( ) for the nodes of the output layer, each fully connected to all hidden nodes.

      Training of the hidden layer

    • They can be chosen randomly from the input data set ( ).
    • The centers can be obtained by unsupervised learning (SOM, k-means clustering) based on the training data.
    • The covariance matrix as well as the center can also be obtained by supervised learning.

    Once the parameters and are available, we can concentrate on finding the weights of the output layer, based on the given training data containing data points , i.e., we need to solve the equation system for the weights ( ):

    This equation system can also be expressed in matrix form:

    where , , and is an matrix function of the input vectors :

    As the number of training data pairs is typically much greater than the number of hidden nodes , the equation system above contains more equations than unknowns, and has no solution. However, we can still try to find an optimal solution so that the actual output approximates with a minimal mean squared error (MSE):

    To find the weights as the parameters of the model, the general linear least squares can be used, based on the pseudo inverse of the non-square matrix:


    Chris McCormick

    A Radial Basis Function Network (RBFN) is a particular type of neural network. In this article, I’ll be describing it’s use as a non-linear classifier.

    Generally, when people talk about neural networks or “Artificial Neural Networks” they are referring to the Multilayer Perceptron (MLP). Each neuron in an MLP takes the weighted some of its input values. That is, each input value is multiplied by a coefficient, and the results are all summed together. A single MLP neuron is a simple linear classifier, but complex non-linear classifiers can be built by combining these neurons into a network.

    To me, the RBFN approach is more intuitive than the MLP. An RBFN performs classification by measuring the input’s similarity to examples from the training set. Each RBFN neuron stores a “prototype”, which is just one of the examples from the training set. When we want to classify a new input, each neuron computes the Euclidean distance between the input and its prototype. Roughly speaking, if the input more closely resembles the class A prototypes than the class B prototypes, it is classified as class A.

    RBF Network Architecture

    The above illustration shows the typical architecture of an RBF Network. It consists of an input vector, a layer of RBF neurons, and an output layer with one node per category or class of data.

    The Input Vector

    The input vector is the n-dimensional vector that you are trying to classify. The entire input vector is shown to each of the RBF neurons.

    The RBF Neurons

    Each RBF neuron stores a “prototype” vector which is just one of the vectors from the training set. Each RBF neuron compares the input vector to its prototype, and outputs a value between 0 and 1 which is a measure of similarity. If the input is equal to the prototype, then the output of that RBF neuron will be 1. As the distance between the input and prototype grows, the response falls off exponentially towards 0. The shape of the RBF neuron’s response is a bell curve, as illustrated in the network architecture diagram.

    The neuron’s response value is also called its “activation” value.

    The prototype vector is also often called the neuron’s “center”, since it’s the value at the center of the bell curve.

    The Output Nodes

    The output of the network consists of a set of nodes, one per category that we are trying to classify. Each output node computes a sort of score for the associated category. Typically, a classification decision is made by assigning the input to the category with the highest score.

    The score is computed by taking a weighted sum of the activation values from every RBF neuron. By weighted sum we mean that an output node associates a weight value with each of the RBF neurons, and multiplies the neuron’s activation by this weight before adding it to the total response.

    Because each output node is computing the score for a different category, every output node has its own set of weights. The output node will typically give a positive weight to the RBF neurons that belong to its category, and a negative weight to the others.

    RBF Neuron Activation Function

    Each RBF neuron computes a measure of the similarity between the input and its prototype vector (taken from the training set). Input vectors which are more similar to the prototype return a result closer to 1. There are different possible choices of similarity functions, but the most popular is based on the Gaussian. Below is the equation for a Gaussian with a one-dimensional input.

    Where x is the input, mu is the mean, and sigma is the standard deviation. This produces the familiar bell curve shown below, which is centered at the mean, mu (in the below plot the mean is 5 and sigma is 1).

    The RBF neuron activation function is slightly different, and is typically written as:

    In the Gaussian distribution, mu refers to the mean of the distribution. Here, it is the prototype vector which is at the center of the bell curve.

    For the activation function, phi, we aren’t directly interested in the value of the standard deviation, sigma, so we make a couple simplifying modifications.

    The first change is that we’ve removed the outer coefficient, 1 / (sigma * sqrt(2 * pi)). This term normally controls the height of the Gaussian. Here, though, it is redundant with the weights applied by the output nodes. During training, the output nodes will learn the correct coefficient or “weight” to apply to the neuron’s response.

    The second change is that we’ve replaced the inner coefficient, 1 / (2 * sigma^2), with a single parameter ‘beta’. This beta coefficient controls the width of the bell curve. Again, in this context, we don’t care about the value of sigma, we just care that there’s some coefficient which is controlling the width of the bell curve. So we simplify the equation by replacing the term with a single variable.

    RBF Neuron activation for different values of beta

    There is also a slight change in notation here when we apply the equation to n-dimensional vectors. The double bar notation in the activation equation indicates that we are taking the Euclidean distance between x and mu, and squaring the result. For the 1-dimensional Gaussian, this simplifies to just (x – mu)^2.

    It’s important to note that the underlying metric here for evaluating the similarity between an input vector and a prototype is the Euclidean distance between the two vectors.

    Also, each RBF neuron will produce its largest response when the input is equal to the prototype vector. This allows to take it as a measure of similarity, and sum the results from all of the RBF neurons.

    As we move out from the prototype vector, the response falls off exponentially. Recall from the RBFN architecture illustration that the output node for each category takes the weighted sum of every RBF neuron in the network–in other words, every neuron in the network will have some influence over the classification decision. The exponential fall off of the activation function, however, means that the neurons whose prototypes are far from the input vector will actually contribute very little to the result.

    If you are interested in gaining a deeper understanding of how the Gaussian equation produces this bell curve shape, check out my post on the Gaussian Kernel.

    Example Dataset

    Before going into the details on training an RBFN, let’s look at a fully trained example.

    In the below dataset, we have two dimensional data points which belong to one of two classes, indicated by the blue x’s and red circles. I’ve trained an RBF Network with 20 RBF neurons on this data set. The prototypes selected are marked by black asterisks.

    We can also visualize the category 1 (red circle) score over the input space. We could do this with a 3D mesh, or a contour plot like the one below. The contour plot is like a topographical map.

    The areas where the category 1 score is highest are colored dark red, and the areas where the score is lowest are dark blue. The values range from -0.2 to 1.38.

    I’ve included the positions of the prototypes again as black asterisks. You can see how the hills in the output values are centered around these prototypes.

    It’s also interesting to look at the weights used by output nodes to remove some of the mystery.

    For the category 1 output node, all of the weights for the category 2 RBF neurons are negative:

    -0.79934
    -1.26054
    -0.68206
    -0.68042
    -0.65370
    -0.63270
    -0.65949
    -0.83266
    -0.82232
    -0.64140

    And all of the weights for category 1 RBF neurons are positive:
    0.78968
    0.64239
    0.61945
    0.44939
    0.83147
    0.61682
    0.49100
    0.57227
    0.68786
    0.84207

    Finally, we can plot an approximation of the decision boundary (the line where the category 1 and category 2 scores are equal).

    To plot the decision boundary, I’ve computed the scores over a finite grid. As a result, the decision boundary is jagged. I believe the true decision boundary would be smoother.

    Training The RBFN

    The training process for an RBFN consists of selecting three sets of parameters: the prototypes (mu) and beta coefficient for each of the RBF neurons, and the matrix of output weights between the RBF neurons and the output nodes.

    There are many possible approaches to selecting the prototypes and their variances. The following paper provides an overview of common approaches to training RBFNs. I read through it to familiarize myself with some of the details of RBF training, and chose specific approaches from it that made the most sense to me.

    It seems like there’s pretty much no “wrong” way to select the prototypes for the RBF neurons. In fact, two possible approaches are to create an RBF neuron for every training example, or to just randomly select k prototypes from the training data. The reason the requirements are so loose is that, given enough RBF neurons, an RBFN can define any arbitrarily complex decision boundary. In other words, you can always improve its accuracy by using more RBF neurons.

    What it really comes down to is a question of efficiency–more RBF neurons means more compute time, so it’s ideal if we can achieve good accuracy using as few RBF neurons as possible.

    One of the approaches for making an intelligent selection of prototypes is to perform k-Means clustering on your training set and to use the cluster centers as the prototypes. I won’t describe k-Means clustering in detail here, but it’s a fairly straight forward algorithm that you can find good tutorials for.

    When applying k-means, we first want to separate the training examples by category–we don’t want the clusters to include data points from multiple classes.

    Here again is the example data set with the selected prototypes. I ran k-means clustering with a k of 10 twice, once for the first class, and again for the second class, giving me a total of 20 clusters. Again, the cluster centers are marked with a black asterisk ‘*’.

    I’ve been claiming that the prototypes are just examples from the training set–here you can see that’s not technically true. The cluster centers are computed as the average of all of the points in the cluster.

    How many clusters to pick per class has to be determined “heuristically”. Higher values of k mean more prototypes, which enables a more complex decision boundary but also means more computations to evaluate the network.

    Selecting Beta Values

    If you use k-means clustering to select your prototypes, then one simple method for specifying the beta coefficients is to set sigma equal to the average distance between all points in the cluster and the cluster center.

    Here, mu is the cluster centroid, m is the number of training samples belonging to this cluster, and x_i is the ith training sample in the cluster.

    Once we have the sigma value for the cluster, we compute beta as:

    Output Weights

    The final set of parameters to train are the output weights. These can be trained using gradient descent (also known as least mean squares).

    First, for every data point in your training set, compute the activation values of the RBF neurons. These activation values become the training inputs to gradient descent.

    The linear equation needs a bias term, so we always add a fixed value of 𔃱’ to the beginning of the vector of activation values.

    Gradient descent must be run separately for each output node (that is, for each class in your data set).

    For the output labels, use the value 𔃱’ for samples that belong to the same category as the output node, and 𔃰’ for all other samples. For example, if our data set has three classes, and we’re learning the weights for output node 3, then all category 3 examples should be labeled as 𔃱’ and all category 1 and 2 examples should be labeled as 0.

    RBFN as a Neural Network

    So far, I’ve avoided using some of the typical neural network nomenclature to describe RBFNs. Since most papers do use neural network terminology when talking about RBFNs, I thought I’d provide some explanation on that here. Below is another version of the RBFN architecture diagram.

    Here the RBFN is viewed as a 𔄛-layer network” where the input vector is the first layer, the second “hidden” layer is the RBF neurons, and the third layer is the output layer containing linear combination neurons.

    One bit of terminology that really had me confused for a while is that the prototype vectors used by the RBFN neurons are sometimes referred to as the “input weights”. I generally think of weights as being coefficients, meaning that the weights will be multiplied against an input value. Here, though, we’re computing the distance between the input vector and the “input weights” (the prototype vector).


    Radial Basis Function Network versus Regression Model in Manufacturing Processes Prediction

    One of the objectives of manufacturing industry, is to increase the efficiency in their processes using different methodologies, such as statistical modeling, for production control and decision-making. However, the classical tools sometimes have difficulty to depict the manufacturing processes. This paper is a comparative study between a multiple regression model and a Radial Basis Function Neural Network in terms of the statistical metrics R2 and R2 adj applied in a permanent mold casting process and TIG welding process. Results showed that in both cases, the RBF network performed better than Regression model.

    Keywords: Radial basis function Multiple regression Process prediction

    Introduction

    Nowadays, the manufacturing companies have been increased difficulty in their process decision making, due to rapid changes in design methods and demand for quality products [1]. For that reason, there are different tools for modeling a process, like statistical, mathematical and intelligent systems, but the question is which of these tools depict better the process?

    The multiple regression is a statistical model that analyze how a set of predictor variables X are related to a single response measured y. Regression analysis answers questions about the dependence of a single response variable on one or more predictors, including prediction of future values, discovering which predictors are important and estimating the impact of changing a predictor or a treatment on the value of the response [2]. The Radial Basis Function (RBF) neural network aids to explain the process outputs based on the inputs assigned to it. For example, to automate a manufacturing process, it is necessary to know the input-output relationship in both directions, and using a radial basis network it is possible to predict the results of a manufacturing process efficiently [3]. The RBF networks are important in prediction by his character of universal approximators [4] and for its good performance in the non-linearity common in processes [5]. In this paper, we propose a comparative study between a multiple regression model and a Radial Basis Function Neural Network in terms of the statistical metrics R 2 , R 2 adj, R 2 PRESS applied in a permanent mold casting process.

    Methods

    Radial basis function network

    Are so called Radial basis functions because the functions of the hidden layer as a base set for the function to be approximated, and the functions display a radial symmetry, being only a function of the distance between the learned patterns and the input ones.

    A neural network with radial basis function consists of the following layers [6]:

    Input layer: it is formed by the source nodes (sensory units).

    Intermediate layer: it is a hidden layer of great dimension and in which the units (neurons) that form it are the base functions for the input data.

    Output layer: that has the responsibility in the network for the activation of patterns applied in the input layer.

    Radial basis functions are functions that reach a level close to the maximum of their path when the input pattern (Xn) is close to the center of the neuron. If the pattern moves away from the center, the value of the function tends to the minimum value of its path. The training is only forward. The output of a network in general is influenced by a non-linear transformation, originating in the hidden layer through the radial function and a linear one in the output layer through the continuous linear function. A derivation of the radial basis models is the use of the standard deviation to activate the function G(*) , working with exp(d 2 / a), where a is the standard deviation for the hidden node.

    Genetic algorithm

    As mentioned above, the RBF output depends on the distance between the inputs to the network centers. There are many methods of clustering and optimization for determining the centers. One of them is the Genetic Algorithm (GA), which is a method of optimization based on the processes of biological evolution [7]. It is part of intelligent systems.

    The process consists in select randomly individuals of the current population these individuals will be the parents of the next generation that will be evolve to an optimal solution. The GA works on three main rules [8]:

    Parent selection of the next generation.

    The combination of parents to form the next generation.

    Applying random changes of each parent for the children.

    The GA considers one function of evaluation (fitness function) to optimize. The objective is maximizing or minimizing such fitness function. Applying GA to determine the centroids of the RBF, the metric used is R^2 which is a global evaluation metric [9], in this case the objective is to maximize this metric.

    Multiple regression

    A regression model which involves more than one regressor variable is called multiple regression model. In general, you can relate a response Y with k regressors or predictors. The statistical model to explain the behavior of the dependent variable including any number of independent variables is equation (1).

    The deviation of an observation Yi from its population mean E[Yi] is taken into account by adding a random error E[Yi] .There are k independent variables and k+1 parameters to be estimated. Usually, the estimation of the parameters βj is made by means of the Ordinary least Square Method (OLS). In a matrix form, the OLS estimator is given by [9]:

    Application

    Figure 1: Radial basis function structure.

    Figure 2: Target vs regression.

    The process consists of a permanent mold casting for manufacturing a piece for the electrical industry. They have three independent variables in eight runs. The data were obtained using a factorial design with 3 factors 2 levels. The objective of the modeling is to find the variable that causes more defects and minimize them. The model response is the total defects in the process. The observations are shown in Table 1. The comparison between the real responses and the Regression model is showed in the Figure 1 and the comparison between Radial Basis Function and real response is showed in the Figure 2 & 3.

    Figure 3: Target vs RBF.

    Table 1: Model variables.

    Table 2: Comparison of association measurements.

    Table 3: Comparison of association measurements.

    It is possible to see graphically the FBR network model has a better fit than by regression. After that, the measurements of association were estimated to test the performance of a Regression model and RBF. Results are shown in Table 2. It is observed at Table 2, that the RBF Network depicts the process variation over that 90% in terms of the coefficients of determination. The method was applied in the TIG welding process too, using a welding robot Kuka KR-16, with multi-process welding system (MIG and TIG). Then taking process data and identifying the important parameters: feed rate (IPM), input voltage (volts), wire speed (m/min), and the output, the fusion 2 of the welding.

    The Figure 4 shows the response called Fusion 2, and measures the penetration of the vertical element to be attached to the horizontal element. This response is important in the material properties since its control depends on the strength of the joint. The Table 3 illustrates the results about the application and comparison between the RBF with Regression model. Those metrics are quantities used to express the proportion of total variability in the response accounted by the model. So that indicates the proportion of variability in yexplained by the model. Then the RBF is better than multivariate regression model.

    Figure 4: TIG welding.

    Conclusion

    In order to improve manufacturing processes, it is very important to analyze the process to take better decisions. In this article it was shown that a tool based alternative intelligent systems, generates a better fit compared to regression analysis in terms of the measures of association. This statistical metrics provide information about the strength of the relationships between predictors and the dependent variable. With these results, it is possible to find the parameters that cause more defects. However, for future work is proposed to consider analyze each defect in a model for joint prediction, considering that the model design must be made multivariate: what will happen with the Regression model and RBF?

    References

    © 2018 Homero De Jesus De Leon Delgado. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and build upon your work non-commercially.


    Watch the video: Radial basis function RBF networks (May 2022).


Comments:

  1. Tojalabar

    Let's talk, give me what to say on this issue.

  2. Paegastun

    He refrains from commenting.

  3. Samuzshura

    I think, that you are mistaken. Let's discuss it. Write to me in PM, we will communicate.

  4. Croydon

    What words ... Super different phrase

  5. Falcon

    I consider, that you are mistaken. I propose to discuss it. Email me at PM.

  6. Mac An T-Saoir

    Thank you very much for your help.



Write a message