Ever wondered how ISRO and NASA tracks the rockets they fly into outer space? Have you ever thought how a self driving car keeps track of its position and velocity? If you have, then you might be interesting in knowing that its all because of almighty mathematical filters. In this article, I will talk about Bayesian filters which are used extensively nowadays because they are simple and can be executed fast in real time.

## What is a filter?

Filters in mathematics and computer science world are the algorithms used to infer most likely state of a system in a noisy world. What are these state and system? An easy example of a system could be a robot or a self driving car moving in a known or unknown environment and the state at any given time instance could be its position coordinates or velocity or both. But, in real world the state of a system can have any number of variables not just two. Bayesian filters, as the name signifies, is a state estimation technique which recursively use Bayes’ theorem to estimate current state of the system based on measurement inputs. If you didn’t get it its fine. We will discuss about all this in detail.

## Bayes’ Theorem

Before moving further let us discuss briefly about beliefs and probabilities. There are mainly two ways to classify probabilities:

**Bayesian or Evidential**. The Bayesian probability of an event is a person’s degree of belief in that event based on evidences. It is the property of the person who assigns that probability. For example, if somebody asks two persons, what is the probability that India will win a cricket match against Pakistan? Then a Indian might reply 0.9, while a Pakistani might say 0.2. The numbers 0.9 and 0.2 represents the degree of belief of both the people and this can vary from person to person. Statisticians who follow this definition are known as**Bayesians**.**Physical Probability**. This probability is the actual physical probability of an event. For example, the probability of getting a head after tossing a coin. This probability is calculated by repeating the experiment a favourable number of time and reporting the frequency of that particular event. In the above example we can repeat the experiment by tossing the coin 1000 times and reporting the number of heads/1000. Statisticians who follow this definition are known as**Frequentists**.

Throughout this article the probability means Bayesian Probability.

Now let’s move forward and have a look at conditional probability and Bayes’ theorem which are the key concepts in Bayesian filters.

$$ P(A|B)=\frac{P(A \cap B)}{P(B)} \tag{1}\label{1}$$

Bayes’ theorem states that probability of an event \(A\) given that event \(B\) has already happened, also known as the conditional probability of \(A\) given \(B\) is given by joint probability of event \(A\) and \(B\) happening together divided by probability of event \(B\) happening. If you think about this theorem then you might realize that it is quite intuitive, so let’s prove it. First, let’s prove it the bayesian way and then we will prove it the frequentist way.

### Bayesian way:

Imagine we have a bag full of pebbles with a constraint that the total mass is one unit. Let’s call it as our universe or sample space. From the probability theory we know that an event is a subset of the sample space. So, let’s assume that we have two events \(A\) and \(B\) as shown in the image below.

Now let’s say that event \(B\) has already happened. So, what does it mean? It means that everything outside the subset \(B\) is now irrelevant and provides no information what so ever. This implies that we now get rid off pebbles that are not in subset \(B\) so that our new universe or sample space is shrunk to just the subset \(B\) as shown below.

Now you might notice that the only relevant part left of set \(A\) is \(A∩B\). Therefore, now if we want \(P(A|B)\) then it would be proportional to \(A∩B\). But, then you might ask why do we divide by \(P(B)\)? That’s because our new sample space i.e the set \(B\) does not have the total mass one, but by the definition of probability it has to be one. Hence, to make it one we re-normalize by dividing it by the total mass of the set \(B\) which is \(P(B)\).

### Frequentist way:

Lets say we have do an experiment where we toss a coin 5 times and we repeat this experiment multiple times. Our sample space might look something like this:

\(HHTTH,\)

\(HTHTH,\)

\(TTTTH,\)

\(HHHTH,\)

\(….\)

Let’s say that event \(B\) is two tails coming simultaneously and event \(A\) be three heads coming in the experiment. Now if we repeat this experiment 1000 times and say that event \(B\) has already happened then our new sample space would be restricted to only those outcomes which belongs to the event \(B\). Now, if we ask what is \(P(A)\) then from the classical definition of probability it is the long term frequency of event \(A\) among the outcomes of event \(B\). Hence, the \(P(A)\) once the event \(B\) has happened will be \(P(A∩B)/P(B)\).

Now that we have proved Bayes’ theorem lets rewrite it in a different form and talk about bayesian filters.

$$ P(A|B)=\frac{P(A \cap B)}{P(B)}\tag{2}\label{2} $$

$$ P(B|A)=\frac{P(A \cap B)}{P(A)} \implies P(A \cap B)=P(A)P(B|A) \tag{3}\label{3}$$

Substituting \ref{3} in \ref{2}

$$ P(A|B)=\frac{P(A)P(B|A)}{P(B)} \tag{4}\label{4}$$

Here, \(P(A|B)\) is called the **posterior belief** about event \(A\) given event \(B\) has already happened.

\(P(A)\) is called the **prior belief** about event \(A\).

\(P(B|A)\) is called the **likelihood** of event \(B\) given \(A\).

\(P(B)\) is the **total probability** of evidence \(B\).

Now equation \ref{4} can be re-written as,

$$ posterior=\frac{prior*likelihood}{evidence} \tag{5}\label{5}$$

## Bayesian Filters

Bayesian filters are very easy. You just have to recursively apply Bayes’ theorem every time you see an evidence. Here is the pseudo code:

We first start with a known prior distribution \(P(A)\) and the likelihood \(P(B|A)\) . Then we recursively apply Bayes’ theorem for every evidence we see. i.e.

$$ P(A|B) \propto P(A)P(B|A) \tag{6}\label{6} $$

and than normalize it by \(P(B)\) which can be calculated by marginalizing out \(A\) from \(P(A∩B)\) i.e. \(∑P(A)P(B|A)\).

To understand it better let’s take an example. Suppose you are a mathematician and you went to the woods with your dog, but unfortunately you lost him there. Now the only thing you care in this world is finding your dog whom you can’t see because the forest is too dense and dark. So you climb up a tree and starts to call you dog. And the dog can’t hear you either so he barks and that’s the only data you hear about him. But the problem is that the barks are noisy, may be because it echos in that forest. So you can’t infer the exact direction and location from where the bark came from. As a mathematician your mathematical neurons starts to fire and you decided to model this uncertainty in measurement with a Gaussian random variable. To code this in Matlab let’s assume that the forest a grid and the actual position of dog is [3, 5], but you can’t see it and you can only hear the barks generated from a Gaussian distribution centred at the dog’s actual location.

figure(1);clf; figure(2);clf; % Number of iterations for which we will listen to the dog. N = 100 %The hidden actual location of the dog that we are trying to infer. s=[3;5]; % The dog barks N times, and for each bark, you hear him at % some location. The error in estimation of where the dog is will be % defined as Gaussian around where the dog actually is with a standard % deviation of 2. n=2*randn(2,N); % Create vector of iterative barks with a standard deviation of 2. x=zeros(2,N); % x will be the initialized variable for where you think you heard your dog. % center the gaussian random barks around the point where the dog % actually is. and plot figure(1); % make the plot prettier h=plot(s(1),s(2),'r.'); % Plot where the dog actually is set(h,'markersize',40,' linewidth',3); % make pretty axis([0,10,0,10]); % make pretty hold off; hold on for i=1:N x(:,i)=s+n(:,i); plot(x(1,i),x(2,i),'k.','markersize',10); pause end;

Imagine the forest is like a grid as shown and you are at the origin. The red dot is the actual location of the dog which you are trying to infer from the barks you hear; which are the black dots.

Since you don’t know the location of your dog you start with a uniform prior distribution around the entire forest. That means that every location in the forest is equally likely to be the position of your dog. Now every time you hear a bark you try to incorporate that evidence into your prior belief using Bayes’ theorem and call it as the posterior belief (See equation \ref{5} and \ref{6}) and you do it recursively for 100 iterations. Here is the code.

%define our forest grid Sa=[2:0.05:4]; Sb=[4:0.05:6]; %no bias, no prior knowledge, uniform distribution L=length(Sa); Pr=ones(L,L); % Initialize all one --> uniform prior Po=ones(L,L); %duplicate for iterative update Pr=Pr/sum(sum(Pr)); % Turn the prior into a pmf by dividing by the sum. Po=Po/sum(sum(Po)); % Each value is now 1/(number of states), so it all sums to one. figure(1);clf;mesh(Po), axis([0 40 0 40 0 0.015]) %%iterative bayes [a,b]=find(Po==max(max(Po))); % Pull out the indices at which Po achieves its max to start. sest=[Sa(a);Sb(b)]; % The best estimate of the true state to start. figure(1); clf figure(2); clf subplot(211); plot(1,sest(1)); hold on; line([1,N],[s(1),s(1)]); % Draw a line at the location of the x dimension. subplot(212); plot(1,sest(2)); hold on; line([1,N],[s(2),s(2)]); % Draw a line at the location of the y dimension. K=[4,0;0,4]; % covariance matrix for making a 2-D gaussian for (n=2:length(x)); Pr=Po; %store the posterior to the prior. m=0*Pr; %likelihood % look at each location, assume that the given location is % where the dog is, and get the likelihood of the data x(:,n) assuming % 2-d gaussian noise for (i=1:length(Pr)) for (j=1:length(Pr)) me=[Sa(i);Sb(j)]; m(i,j) = 1/sqrt((2*pi)^2*det(K)) * exp(-(x(:,n)-me)'*inv(K)*(x(:,n)-me)/2); %Compute likelihood m(i,j) = m(i,j) * Pr(i,j); % Combine this likelihood with the prior end; end; Po=m/sum(sum(m)); %normalize this distribution to make it a proper probability distribution. figure(1);mesh(Po), axis([0 40 0 40 0 0.015]) %plot it figure(2); [a,b]=find(Po==max(max(Po))); % Get the peak value; it's most likely location of the Quail. sest=[Sa(a);Sb(b)]; %A store the coordinates of this location in the bushes subplot(211);plot(n,sest(1),'k.');axis([0 N 2 4 ]) subplot(212); plot(n,sest(2),'k.');axis([0 N 4 6 ]) pause end; subplot(211); hold off; subplot(212); hold off;

You can see the initial uniform prior distribution. At this point you don’t have any idea where the dog it. So we start hearing the barks and update our belief about the dog’s location. The image on the right shows the progress of the algorithm by plotting the most likely location of the dog after each iteration. The centre lines are the actual position of the dog and we will see that our algorithm will eventually converge towards these lines.

After 20 iterations, you can see that we have started getting Gaussian peaks instead of a uniform distribution. It means that we are now more sure about the dog’s location and you can see the algorithm converging towards the centre lines.

After 60 iterations:

After 100 iterations, you can see that our algorithm has almost converged to the actual location of the dog. And we have a nice peaky Gaussian representing a very strong belief. Now you can say with a very high probability that your dog is at location [3, 5]. Go pick him up and come back home

## Conclusion:

Bayesian filters are very easy to understand, code and can be explained without any ugly mathematics involved.