Category Archives: predictability

The 57-Year-Old Chart That Is Dividing the Fed – The New York Times

The United States economy is, after all, determined largely by the endlessly complicated interactions of 320 million people producing $17 trillion worth of stuff, which even relatively complex models can’t keep up with.There’s a difference, though, between economic forecasting and weather forecasting. People don’t have any short-term influence over the weather, but central banks and other economic policy makers do have influence over the short-term course of the economy.

Source: The 57-Year-Old Chart That Is Dividing the Fed – The New York Times

 

Nicely said. In other words (using control systems terminology), while both the economy and the weather are observable, neither are predictable, and while the economy is controllable the weather is not (with today’s technology). One could also argue that both are unstable systems.

Observability, Predictability, Controlability and Stability are core concepts in the theory of control of dynamical systems. This field is literally 80 years old, dating back to when the first rocket control systems were designed. It enjoyed the limelight during the space age and the cold war. Then it became kind of old fashioned and passe, like power systems engineering, as researcher and students flocked to more lucrative areas like software engineering and computer science.

But dynamical systems theory has been enjoying a kind of renaissance (as is power engineering) as we rediscover the power and wide applicability of the theory, from weather, to stock markets, to data networks to behavior of consumers. And, as I have said many years ago in the introduction to my blog, dynamical systems theory actually has deep connections to information theory and machine learning through the concept of statistical manifolds (“surfaces”) and the paradigm of viewing prediction and estimation algorithms as dynamical systems that crawl these manifolds via a prescribed dynamical law (“algorithm”). So I would highly encourage budding students who want to get a deep understanding of machine learning and predictability to take the time to study dynamical control systems and differential geometry, in addition to the standard fare of information theory, Algorithms, ML, and probability and random processes .

Gamblers, Scientists and the Mysterious Hot Hand – The New York Times

We evolved with this uncanny ability to find patterns. The difficulty lies in separating what really exists from what is only in our minds.

Source: Gamblers, Scientists and the Mysterious Hot Hand – The New York Times

Beautiful article! If you have been following my blog at all, you must read this. And then perhaps follow up with a refresher from one of my early posts on stereotypes!

Treadmill May Be Riskiest Machine, but Injuries From It Still Rare – NYTimes.com

Treadmill May Be Riskiest Machine, but Injuries From It Still Rare – NYTimes.com.

The tragic death of David Goldberg, apparently from a fall off a treadmill, has caused a flurry of comments and analysis by pundits about the risk of using treadmills and such equipment, and ridiculous comparisons to risk of getting hit by a lightning or getting into a car accident. This article tries to do a decent job of explaining the issue in a non-technical language, but still blunders in a couple of places.  And the comments made by readers definitely betray a basic misunderstanding of conditional probability vs unconditional probability.

So lets try to put some sense into this.

A person can die of natural vs unnatural causes. Natural causes are related to your health, and include genetic proclivities and choices of lifestyle like food exercise etc. Natural causes of death can be a heart attack, stroke, cancer, slow degenerative disease etc. Anything that happens due to a slow integration of multiple reasons over a life time.  (However I would not include death by an infection like Ebola or HIV in this category because it more incidental and can be attributed to a specific event of exposure to a pathogen.)  Lets just exclude all death by natural causes for the time being, and only concentrate on unnatural deaths.

Unnatural deaths can be due to one of multiple causes like murder, getting struck by lightning, falling off a treadmill, getting to a car accident, getting into a plane accident, act of terrorism in a public place or infection of Ebola. Lets make a couple of reasonable assumptions to simplify the argument (these are not necessarily mathematical truths, but reasonable approximations which will not appreciably change the answers of our analysis but will indeed hugely reduce the complexity of analysis. Good engineers and scientists are, by definition, adept at these type of  simplifications):

  1. these causes are disjoint (you cannot simultaneously get killed by murder and lightning)
  2. each of these causes is associated with one or more voluntary activities that you were doing. For example
    • car accident – activities: travelling in a car, walking on a road, biking on a road
    • lightning – activities: being out doors in an open space, walking on a road
    • terrorism – activities: being in a public place, travelling in a plane
    • falling off a treadmill – activities: using a treadmill, texting while exercising
    • Ebola – activities: travelling in Africa, walking on a road

It is noteworthy that a single activity can lead to death by one of multiple unnatural causes. You can be walking on a road an get struck by lightning or get murdered. (But usually not both simultaneously).

Also you can be doing multiple activities at the same time:  You may be in Africa, and you may be walking on a road, and you may be in a public place.

We can now  look at several types of probability distributions:

P(death due to cause C), where C is a natural cause or an unnatural cause.

This is simply the marginal distribution over all causes of death. The probability of dying due to lightning is far smaller than the probability of dying due to a heart attack. Fine. So what? This is telling us nothing about what how an active choice of a user affects his cause of death. So comparing probabilities of cause of death across various causes is essentially useless, especially because it is done over natural as well as unnatural causes, as natural causes are out of our control in some sense.

P(death due to unnatural cause U| death occurred due to unnatural cause)

This is the marginal conditional probability of unnatural death over the unnatural causes. (Sums up to one). This is slightly more interesting – for example an unnatural death due to lightning is surely far less likely than due to car accident.  About 6000 people are killed by lightning in a year world wide, while about a million are killed in car accidents. Interestingly, about 2500 are killed in plane accidents. So the probability of getting killed by a plane accident is not that different than the probability of getting killed by lightning! But this is still not the complete story because we are still not tying it to user actions/choices.

P(death due to unnatural cause U| death was by unnatural cause AND user was doing activity A)

This is the conditional probability of getting killed by unnatural cause U when doing activity A. This gets more interesting because now you are looking at a very conditional distribution where the conditioning is on something that the user did. For example suppose a person is walking on a road and he dies an unnatural death. Then the cause of his death is most likely  a car accident, then perhaps terrorism, then lightning etc. They all need to sum up to one  because the person did die.

Now we can add the final ingredient – what is the rate or probability at which people do various activities? Because activities can overlap, this is not simply a distribution over activities. Rather we need to consider the set of all disjoint combinations of activities – lets call them activity patterns. walking on a road in Africa, using a Treadmill in Africa, driving a car in USA etc. This leads to a pretty huge alphabet of possibilities, but theoretically it is do-able. And in terms of probability, only a few thousand activity patterns will likely be dominant.

Anyway now we come to the most important Bayesian formula:

The all-up probability of death by an unnatural cause decomposes as follows:

P(unnatural death) =

Sum_U  P(unnatural death by cause U)

= Sum_U  Sum_A  P(unnatural death by U| activity pattern A) x P(pattern A)

In this double sum, each term is a contribution from an activity pattern A leading to unnatural cause U. Note that each term includes the rate of the corresponding activity pattern. You can view this double sum as a sum over the elements of  a matrix.

In my opinion only these matrix entries are truly comparable. It may well be (I am just speculating here) that the entry corresponding to “dying due to U={accidental fall} when A={exercising a treadmill at home in the US}” may be somewhat larger that “dying due to U={accidental fall} when A={exercising a treadmill at a gym in the US}”, perhaps due to better supervision in the gym. But it may also be that these entries are much smaller that “dying due to U={accidental fall} when A={riding a bicycle on a road home in the US}”.  So bikes may well me a much more significant cause of death than treadmills. (Reminder, this is all speculation – I don’t have the raw data. I am just outlining the way the analysis needs to be done.)

The moral of the story is that the intensities at which activity patterns happen must be taken into account while comparing risks due to an activity pattern. While in principle it may seem that an activity pattern is something that you choose to do, in reality there is often no choice because life demands that you do that activity often. For example driving on a road is something that most people must do on a regular basis.  So while the conditional probability of getting killed if you are being launched into space by a rocket may be quite high (Say 1%)  as compared to getting killed while driving down a road (say .001%), the sheer intensity/probability of driving is much higher that being launched into space, so the probability contribution of driving (the sum of entries in the matrix involving driving) is likely much higher that the contribution from space launches.

Both conditional and unconditional probabilities are useful, but an analyst must take care to condition on things that are truly controllable by a typical person (true choice) while averaging over things that are substantially outside his control.

Those of you who understand Bayesian probability calculation may feel I am belaboring an obvious point, but there is so much blatant misreporting around this that I feel that this point does need to be made repeatedly.

 

 

 

 

Yes, Your Time as a Parent Does Make a Difference – NYTimes.com

Trying to get a sense of the time you spend parenting from a single day’s diary is a bit like trying to measure your income from a single day. If yesterday was payday, you look rich, but if it’s not, you would be reported as dead broke. You get a clearer picture only by looking at your income — or your parenting time — over a more meaningful period.

via Yes, Your Time as a Parent Does Make a Difference – NYTimes.com.

The author makes a really good point. The sampling used in the study the author is referring to seems to be awful. A whole bunch of noise is being added because of sampling on just a couple of days, which can grossly dilute any evidence of statistical dependence.  So the only noise reduction available is the sample size (#people), and  that may be too small to give statistical significance.

Moreover, rather that reporting the results as an AB test (treatment vs control), the users are reporting correlation and also the predictive power of a simple linear regression model.

This is very naive. Firstly, correlation is a poor indicator of statistical significance. It is grossly dependent on the coding used for the variables. (Suppose the probability of rain on any day is 0.5, you never take an umbrella when it does not rain, and you take an umbrella with probability 0.5 on the day  it rains. Obviously taking the umbrella and the fact of raining is statistically dependent. But if you simply code rain={0=no,1=yes} and take-umbrella={-1=no, +1 =yes}, then the correlation of rain and umbrella turns out to be zero!)

Secondly, a simple linear regression model is hardly the epitome of machine learning. Just because a linear regression model did not show good predictive power does not prove that there is not predictive power in the parental time about child’s outcome. The authors should have tried more sophisticated predictive techniques like kernel regression, deep learning or even the brute force MAP technique (with proper training and validation phases).

Lastly, the whole premise of the study seems to be flawed. The study is measuring the predictive power of time spent with the child. But that is hardly the full story when it comes to parental attention. How about other variables like “How many times do you attend parent teacher conferences?”, “How many times did you help child prepare a science project”?, “How many times do you take the child to a swimming/music/soccer class?”, “How many times do you tell the child a bed time story?”. The quality of time, rather than the amount of time is important, and this is given no attention whatsoever.

So if any one tries to use this single study to belittle the importance of parental involvement, don’t use rhetorical arguments, rather use scientific arguments of the type I gave above, to convince that person otherwise.

Parental involvement does matter! It matters a lot. It may matter more that money or social standing or peer culture.  A dedicated parent is the best asset a child can have!

 

Tiger numbers could be a result of methodological mistake: Scientists – The Times of India

Tiger numbers could be a result of methodological mistake: Scientists – The Times of India.

Even wildlife conservation needs a solid understanding of statistics!

I am starting to think that a troubling number of folks, especially those in the media and government (who ought to know better, given their position of responsibility) have an ingrained misunderstanding of the concepts of randomness and noise.

Education systems around the world raise kids on a staple of deterministic “Newtonian” techniques. There is an input and there is a well-determined output.  You solve equations and get fixed answers. You learn multiplication tables by heart. You memorize truth tables. Even when you go for undergraduate studies, in advanced or engineering mathematics, you spend an inordinate time solving differential equations and finding closed form solutions. Very little time is spent on probability. The notion that the equation you are solving, or the input-output system you are modelling has parameters, and that those parameters are random variables, and that you will always have an error in the estimation of those variables is given scant attention. The idea that finding precise solutions to systems of equations is moot when the coefficients of the equations are themselves not precisely known is given little attention.

I remember an occasion in high school when we were asked to calculate the period of a simple pendulum – a ball handing from a string tied to a hook at on end. We were given a micrometer to measure the diameter of the ball (accuracy of a hundredth of a mm!) and a wooden foot-ruler to measure the length of the string (accuracy of  half a cm!), and a hand held stop-watch to measure the period of the pendulum. And we were required to “exactly” verify the the oscillation law of the pendulum.

I remember the teacher frowning when our data points did not match up well with the law. Our measurement was obviously flawed, he had remarked.

None of the teachers bothered to point of out that adding a precisely measured ball-diameter to a vaguely measured string length, would be pointless. The latter’s error would overpower the former. Also no one bothered to question what the accuracy of the stop watch based measurement of the period would be, and whether small changes in length would even cause a detectable change in period. And what about wind or air resistance?

We learnt everything about the exact pure-vacuum Newtonian law of the pendulum but nothing about the error in measuring that law in real life situations. The fact that the law could be verified only up to a certain accuracy (with given equipment) was never illuminated. This is exactly the failing that results in the kind of census error this story refers to.

There is noise everywhere in the world. We need to be very vigilant when drawing conclusions from any measurement!

Gene Linked to Obesity Hasn’t Always Been a Problem, Study Finds – NYTimes.com

People born before the early 1940s were not at additional risk of putting on weight if they had the risky variant of FTO. Only subjects born in later years had a greater risk. And the more recently they were born, the scientists found, the greater the gene’s effect

via Gene Linked to Obesity Hasn’t Always Been a Problem, Study Finds – NYTimes.com.

Couching this in terms of machine learning, we would say that there is significant mutual information between the disease (obesity in this case) and the genome as well as the environment of the individual having that genome. This means that an optimal predictor should make use of the the features of not only the genome but also the environment. Otherwise the variability of the environment will spoil the prediction accuracy of the predictor.

This is actually a very significant theme for modern medicine, I believe. While people have paid lip service to the fact that health is determined by genes as well as the environment (the usual nature vs nurture debate), there has been very little research on actually quantifying this for different types of medical conditions. And there is the broader question about things like IQ, artistic achievement and so on and how they relate to upbringing.

This is actually a solvable big data problem, provided we could gather environmental data at scale. And this is actually possible now with the advent of the Internet of things. We very well have the capacity to exhaustively monitor our environment like air quality, chemical and noise pollution, weather, cosmic rays, and what not. Plus using wearable devices like Fitbit we can monitor our bio-metrics in detail. Coupling these two stream of information with the genomic information will gives us the big-data corpus needed to quantitatively understand the nature-vs-nurture debate and move the discussion beyond mere polemic. This will have immense and real payoff in terms of saving and empowering people`s lives.

 

 

 

When a Health Plan Knows How You Shop – NYTimes.com

When a Health Plan Knows How You Shop – NYTimes.com.

Predictive analytics at work! Apparently if you are a mail order shopper you are more likely to use emergency services.

Be that as it may, we should guard against the cardinal sin of confusing correlation with causation. As the article itself points out, you may be home bound because of an infirmity or an inability to drive due to a medical condition and that may also be the reason you are more likely to need emergency care. So in such a case, mail order shopping would only be a symptom, not the cause, and stopping mail order shopping may not make you more healthy.  On the other hand, doing  a lot of internet shopping probably also involves a lot of sedentary activity, and that indeed can cause health problems. So perhaps switching to mobile internet shopping while on the move  may  actually be a helpful change.

Disentangling causation from correlation can be a nightmare. And it needs more that statistics and machine learning. It needs some aspects of control theory, time series modelling and good old physics. ML just tries to build the model that gives the best precision/recall trade-off over a set of validation data. But for understanding causation, we need to use models that are not only precise, but also understandable and plausible in terms of physical processes.

My favorite example is the phenomena of inter symbol interference (ISI)  in wireless communications. We can fit all kinds of complicated models to the observed interference corrupted signal including complicated kernel methods and even deep learning. But as it happens, the physics  of the situation indicates that ISI is a relatively simple linear phenomena involving convolution of the transmitted signal with a “channel response”  that has several distinct “multi-path” components. Armed with this knowledge we can easily solve the problem of identifying the model using a tap-delay-line model and simple least-squares methods. But importantly, the learned model corresponds to actual physical reality and each “tap” in the model actually maps to a real physical reflector of the radio waves. The strength, phase and delay of the tap can tell us something about the physical object that is reflecting the radio waves. (See my previous blog.)

The moral of the story is that while probability and ML is important, we should not lose sight of more classical disciplines like control theory and mechanics, if we want to go beyond prediction, and actually find  the cause of interesting phenomena!