Category Archives: compression

compression, sufficient statistics, uncertainty and entropy

Compressive Sensing for AI self driving cars

February 23, 2018 anandoka Leave a comment

Another reason why compressive sensing can be important to AI self-driving cars is that the speed at which the torrent of data needs to be analyzed can be aided by using compressive sensing. If the AI of the self-driving car is not informed quickly enough by the sensors and sensor fusion about the state of the car, the AI might not be able to make needed decisions on a timely basis. Suppose that an image of a dog in front of the self-driving car has been received by the camera on the self-driving car, but it takes a minute for the image to be analyzed, and meanwhile the AI is allowing the self-driving car to speed forward. By the time that the sensor fusion has occurred, it might be too late for the AI self-driving car to prevent the car from hitting the dog.

Compressive Sensing for AI self driving cars

The AI community seems to be finally recognizing the immense benefits of compressive sensing for handling and featurizing very copious measurement sets. This is especially important for applications of AI that involves closing the loop in real time. Driver-less cars are obviously a great example, but there are numerous other possibilities such as voice/expression/sentiment/face recognition in the edge (as opposed to the cloud), control of any kind of real-world mission critical machinery such as rockets, optimization of power grids, network traffic routing, remote guided surgery, and so on. All these share the basic need to score and make decisions and close control loops with minimum delay. (aka High-bandwidth control).

But perhaps more subtle are the cases where compressive sensing helps to solve the problem of training itself, no matter if scoring has to be real time or not. Many AI applications stumble on the simple fact that the data is so “wide” that it leads to too many parameters in the models, leading to over-fitting. Currently data scientists solve this problem by careful “feature engineering” – i.e. choosing a small but informative set of functions of the base data and feeding that into models. Compressive sensing is an extremely elegant alternative to this manual process. Essentially what it assures us is that random-like functions (“random projections”) are as good as carefully chosen features in terms of the ultimate performance. The power of random-like projections has been known for quite a while, but the rigorous theory about perfect reconstruction from random projections was fleshed out only in the 2000-2010 decade by Candes and Tao.

Actually, perfect reconstruction is an unnecessarily hard requirement from AI perspective. All we need really is adequate retention of discriminative power (preservation of “Mutual information” between data and labels), which is much easier to satisfy. But I guess the proof that we can reconstruct original data from compressive sensing measurements has an undeniable convincing power that helps seal the deal!

Compressive sensing has been well-known and well used in communications engineering – for example in the guise of “Fountain codes” for data compression and channel coding, and in “Spectral signal processing”. The basic idea is very powerful and very versatile: If the random object you want to describe “adequately” has “sparsity”, which really means low entropy, then a small number of random-like functions can do that job where the number of the functions needed is of the order of the sparsity. The most simple example of this is the rather counter-intuitive idea that projection to a randomly chosen subspace in linear signal detection is as good as projection to the eigen-space (which is computational hard in high-dimensional spaces)!

Sparsity means that of all the possibilities a high dimensional signal could possibly take on, only a few are actually probable. This is the way nature typically works and it allows us to “learn and adapt” to nature. You cannot adapt to a phenomena unless there is some kind of redundancy or repeat-ability in the phenomena. A completely random phenomena is like white noise – you cannot learn from it, you cannot adapt to it. Fortunately we do not live in that kind of hyper-random world. We live in a world of modest randomness!

compression, sufficient statistics

Sufficient statistics and Data Processing theorem

May 19, 2012 anandoka Leave a comment

I am making this short post today while waiting in the car for my better half to finish shopping . (I love my BlackBerry Bold for this!!) I want to talk a bit about sufficient statistics.

Traditionally this topic has been dealt under “parameter estimation” where we are trying to estimate a fixed deterministic quantity by looking at the realizations of random variables that are drawn from a distribution that it parametrizes. But we can also think of sufficient statistics in a “Bayesian” setting, in terms of estimating one random variable from another.

So the question is the following. Suppose we are trying to estimate a RV X from observing RV Y. Suppose I first map Y to Z via some (deterministic) function. Hence Z is also a RV. How can I be guaranteed that I do not lose any performance if I try to estimate X from Z instead of Y? If we indeed don’t lose anything then Z will be defined a sufficient statistic of Y “for estimating X”. The “for” clause is important , because a sufficient statistic for estimating one RV may not be sufficient for something else. Information theoretically Z is a sufficient statistic of Y for X if I(X;Z) = I(X;Y). When Z is a sufficient statistic it also means that conditioned on Z, X and Y are statistically independent.

Moreover, the Data Processing Theorem assures us that I(X;Z) can never be larger than I((X;Y) !!

Again, I cannot stress the importance of this result enough. By processing your data Y, you can only lose information about X. If you manage to preserve all information then your processed results are sufficient statistics. So you have to be extremely careful when processing data. People glibly throw around terms like “data clean up”, “massaging” and “removing outliers” without realizing the dangers of these actions. The point is, we don’t want to make the data “look good to the eye”, rather we want it to tell us everything about X it can without any loss.

This is the reason why I find the current emphasis in the analytics community on “visualization” so disturbing. Visualization is for humans, meaning that you want the human to actually do the act of inference by presenting him the data in two dimensions. That reduces machine learning to simply a “reporting tool”, which is a travesty. The true power of machine learning is that it can make estimation and decisions from complex high dimensional data directly without involving the human in the loop. You can present the estimated answers in a nice graphic format, that’s fine. But don’t force the data itself through a pictorial bottle neck, you will kill it!

Finally, I will mention that processed data Z is often termed as “features” in the ML community. The hope in choosing a set of features is that they are easy to manipulate ( low dimensional, with well shaped distributions) but are still close to being sufficient. The holy grail would be to find the “minimal” (in some sense “smallest”) sufficient statistic or feature set for the problem at hand. I(X;Z) gives us a numerical measure about how close our feature set is to being sufficient.

Next time I will talk about the relation between compression and minimal statistics, and also introduce you to a cool new technique of “universal compression” called “compressed sensing”.

Enjoy your Victoria day/Memorial day holiday!

Anand Oka
Senior Researcher
System Optimization

compression, predictability, probability and information

Information content and optimal prediction

May 7, 2012 anandoka Leave a comment

So finally we are ready to talk about the all important topic of “information content” and optimal prediction of hidden variables from observed quantities. Till now I have avoided math to keep things simple, but now we need precision, so we must perforce use mathematical notation.

So again, lets stick with our ethnicity vs hair color example. (By the way, some of you have been wondering why I insist on using such a “racially sensitive” example. Partly, because it is easy to get a grip on, and partly because I also want the user to think about our many prejudices in day to day life, and whether they any have rational justification.) Let the random variable X denote the ethnicity of the the commuter and let r.v. Y denote his/he hair color. Let p(x,y) denote the distribution of the random vector (X,Y). That is, p((x,y) is the probability that X=x and Y=y. Recall that p(x,y) is conveniently represented by a two dimensional table.

Now we need to introduce a new concept called the marginal distribution of random variable X: It is denoted by p(x) and is simply the probability that X=x, without any consideration of Y. You can think of it as a vector or a table of one column. It is obtained by summing each row of the table p(x,y). Similarly we obtain p(y), the marginal distribution of Y, by summing the columns of the table p(x,y). Now we define a quantity called “Mutual Information between X and Y”, denoted by I(X;Y), as

I(X;Y) = Expectation[ log p(x,y) / (p(x)p(y)) ] = Sum over all (x,y) [ p(x,y) log p(x,y) / (p(x)p(y)) ]

(The units of this quantity are “bits”, if the logarithm is taken to base 2.)

What is this formula saying? It is saying the following: Consider some value x of X, and some value y of Y. Look at the probability of the r.v. (x,y), namely p(x,y). Construct another quantity p(x)p(y) by taking the product of the marginal of X at x and Y at y. Take the ratio of these quantities and takes it logarithm. This will give you a real number. Now repeat this experiment for all possible tuples (x,y), and take their expectation. That is, weight them by their respective probability p(x,y) and then add them. (Alternatively, you can simple draw (x,y) according to the distribution p(x,y) and then take the mean of all the calculated quantities without any weight. This is called Monte Carlo evaluation, where we let nature do the weighting for us simply by the strategy that realizations with more probability are drawn more proportionally more frequently. With a large enough sample size this will give an increasingly accurate approximation of I(X,Y). )

What concept is this quantity capturing? It is capturing the notion of how different the true distribution of (X,Y), namely p(x,y), is from a fake distribution p(x)p(y) constructed by multiplying the marginals. It can be shown easily that I(X,Y) is always non-negative, and it takes the value zero if and only if p(x,y) = p(x)p(y) for all (x,y). But can you guess what this latter property is also called? statistical independence! Thus it follows that only statistically independent variables X and Y can have zero mutual information. The importance of this simple property cannot be over emphasized: The larger the mutual information I(X;Y), the more X tells us about Y, and Y tells us about X in equal measure. When I(X;Y)=0, they tell us nothing about each other. Then obviously no “prediction” is possible of one from another. Thus you might guess that estimating I(X;Y) is the primary project for anyone trying to do predictive analytics. Only if there is substantial I(X;Y) need with bother with particular prediction algorithms, and software to code it

An interesting related information theoretic quantity also gets derived as a special case of Mutual Information. Suppose there is another random variable called X1 which is an exact replica or copy of X. That is X1 takeds exactly the same value as X takes in every realization of (X,Y,X1). The joint distribution p(x,x1) thus has a “diagonal” structure. If you care to do the math you will will find that the above quation specializes to

H(X) = I(X;X1) = Expectation[ log 1.0/p(x) ] = Sum over all x [ p(x) log 1.0/p(x) ]

This quantity is called self-information or entropy of X. Again this quantity is also always non-negative. It takes the value zero if and only if the marginal distribution p(x) has all its probability mass devoted to only one value x. For example the entropy of the “hair color” random variable would be zero if and only if all people in the world had only one color of hair, say black. Thus the entropy is zero when the random variable is deterministic and has no varablility from realization to realization due to randomness. On the other hand, entropy is maximized when all possible values of X have equal probability and hence are equally likely to occur. (This is called a “uniform” distribution.) The quantity H(X) is on primary importance when we wish to compress an information source such as an JPEG photo. It gives us an achievable lower bound on the amount of compression possible on average. Thus you might guess that anyone trying to design a new compression algorithm better have a good idea of the entropy of his source!

Similarly, Mutual Information I(X;Y) is a at the heart of predictive analytics, and gives an achievable lower bound on the error rate possible when estimating X from Y. I often see developers ignoring this simple fact, much to their chagrin. They naively believe that with increasing amounts of training data, or by using some wonderous classification algorithm,the prediction error rate will magically go to zero and they can achieve perfect predictions. This is simply not true! Predictability of X from Y is decided by nature via the distribution p(x,y). We can only hope not to do much worse in practice. I would say that it is good engineering to come within shouting distance of this limit and then stop, and look for other possible observables in order to improve performance. So for example if you are trying to detect handwritten characters using a stylus, you can do only so much if you think of the observable as a “black and white image”. But if you rather think of the observable as a more elaborate r.v. involving the time series of pen pressure and position, you can vastly improve recognition performance, because the mutual information is much higher. Mutual Information thus gives us clear ideas of what to keep and what to throw away from our data.

This became a rather long post, so I will postpone the discussion about “optimal prediction” and “sufficient statistics” till next time.

Cheers!

Anand's musings

Category Archives: compression

Compressive Sensing for AI self driving cars

Sufficient statistics and Data Processing theorem

Information content and optimal prediction

on Data, Information and Intelligence

Share this:

Share this:

Share this:

on Data, Information and Intelligence