Category Archives: agility

IBM’s Design-Centered Strategy to Set Free the Squares – The New York Times

Look at problems first through the prism of users’ needs, research those needs with real people and then build prototype products quickly.

Source: IBM’s Design-Centered Strategy to Set Free the Squares – The New York Times

Lovely article, a must read for aspiring program managers.

and the emphasis on agility of course puts in-product telemetry front-and-center:

“The silver bullet, you might say, is speed, this idea of speed.”

The fallacy of perfection

In this post I would like to round out my series on agile data informed product engineering by talking about how to drive the right kind of data culture in your organization. The main message I want to get across is that data does not have to be perfect in order for it to be useful. I am basically going to expand  a theme that has been aptly disused in Douglas Hubbard’s book “How to measure anything “.

An in-product telemetry system is very different from transaction processing systems like those used in banking or airline reservation or medical records management. The only thing that is common to them sometimes is that they often are big data systems. But that is hardly of much interest to us.  As I have said before, data being “big” is simply a matter of scale and not necessarily a matter of quality or significance. The differences between transaction and telemetry systems, on the other hand, are of critical importance.

In a transactional a system there is little to no tolerance for error of any kind. For example if you deposit 20 checks in your bank and only 19 of them show up in your account, that is likely not acceptable to you. Even a minor mistake cannot be tolerated – like you paying in $200 and the system only acknowledging  $199. Or suppose you book  airline tickets on 20 occasions and once in a while the airline simply forgets your booking and tells you to go home from the airport – that is probably going to give you a very bad day too. Or suppose your doctor orders several important blood count measurements and the report is missing or mistaken about some of the counts – that could lead to a life threatening situation. In all these cases the expectation of data quality is very high and tolerance for error is very low.

But from the point of view of data driven product engineering these kind of systems are over-engineered and downright boring in an intellectual sense. It is like buying a million dollar car to go to the grocery store every day. Yes the ride is great and catches attention, but it is a very poor use of resources. Businesses cannot afford to indulge in such luxury because that prevents them from investing resources in actual product innovation. Great cars are those that give you a real bang for the buck. Those are the cars that have been engineered with care, where trade-offs have been made in a well thought out manner. Anybody can build a great car with a  million dollars. It takes a talented engineering team to build great car with 20K dollars.

Similarly, the aim of a telemetry system for agile product development is to gather enough data and with enough fidelity to make good and timely decisions. It is perfectly okay to not gather all data (i.e. sampling your data) and to even make occasional errors in data collection (allowing some noise), as long as the answers to critical questions are not impacted. And it takes some significant data science to figure out what “enough” means and what the optimal sampling strategy is.

Historically, telemetry and AB experimentation originated in Web services like search and music downloads. In these cases since the back-end sees every transaction, it can do very rich logging on the server side itself.  If such systems, if you can log 10% of the data you might as well log 100% of the data without too much overhead. (Storage is essentially free.) And the investment you make in achieving a certain data quality for 10% of the data would also immediately accrue to 100% data collection too without additional cost.

But the equation changes completely when you are going to collect data from software clients that run in machines on the end-user’s hands, or more generally from physical products that are present in the user’s life – on his body, in his home, in his workplace, and generally in his environment. In these scenarios, the users really pays a cost for collecting telemetry. For example, some of his network bandwidth is consumed, which, for metered devices, means actual dollars spent. It also consumes battery power and hence reduced usable life on a single charge. Finally it may actually degrade the user experience with the product because the product is spending some of its resources like CPU and memory in collecting and transporting telemetry.  So sampling the telemetry in a careful and well thought out manner is extremely critical for in-product telemetry.

But the good news is that, if done properly, heavily (down) sampled data can still give you the same quality of insights as the full data. Suppose you want to find out the average height of all the people who work in your organization. Do you really need to measure each of them from head to toe before answering the question? Obviously not.  Why? Because you probably do not need an accuracy beyond, say, +/- 1 cm, and that error margin can be accomplished from a random sample of a much smaller size. This is elementary statistics, and yet is often overlooked. (The “random”  part is of course very important, because by choosing a non-random sample, such as only men, or only people who work in the the boiler room, you can easily bias your answer.) Asking for an unreasonably high level of accuracy and completeness in data such as 99.999% may sound like being very responsible and diligent. But if your hypothesis could be answered say with only a 1% sample, then your insistence on 99.999% is in quite wrong and irresponsible!

For one, it may be an unrealistically high a standard to be met even after throwing the state of the art technology at the problem. Secondly, this can grossly delay your product life cycle and put you at a competitive disadvantage. And thirdly, the insistence of unrealistic level of completeness from existing signals may cause you to lose focus on collecting new, hitherto untouched, types of signals. A cardinal lesson from information theory and machine learning is that new types of signals often give more additional information that more densely sampled data from existing signals. For example, instead of trying to drive the fidelity of click telemetry in you app/web-page to five nines, first consider other types of types of telemetry such as cursor hovers, swipes, pagination and other control.

So next time someone says to you “we need 99.999% data completeness!”, do vigorously push back and ask them “why do you need 5 nines? What kind of question are you asking that requires this level of comprehensiveness?”. Almost always you will find that they have not thought about the scenario in much depth at all.

Selecting a good bouquet of signals for in-product telemetry, and selecting a good sampling rate and a sampling strategy is almost an art as it is a science. It is something that you do better and better as you work in this field.  You need to have a sense of the hypotheses that are being posed and the type of accuracy and confidence that you need in your answers. You also need to have a good instinctive feel for how your users are actually using your product and what type of signals are most informative about their experience. And finally you need to have a good sense for the competitive landscape – how fast are your competitors evolving and what level of agility is needed to catch up and overtake them (if you are behind) or to maintain a safe lead (if you are already ahead). This is where you start putting on multiple hats – sometimes that of a data scientist, sometimes that of a product/program manager and some times simply as the end user of your product because he/she does deserve your empathy and understanding!

Data Platform – the Muscle

Continuing with the topic of data driven engineering, today I would like to talk about the Data Platform. In many ways this is the “muscle” of data driven engineering and it differentiates the men from the mice, so to speak.

Some people pursue body building for its own sake, to win competitions and earn bragging rights. Others work out and develop muscles for very particular purposes. Oarsmen develop thick short power muscles in their thighs and lower back, sprinters develop their upper body for propulsion, and marathoners develop long lean muscles in their calves. If Data Platform is a muscle, then its development should be considered to be in the second category. Companies and businesses do not win or lose battles for market share simply because they do or do not have a data platform on steroids. Rather what matters is whether the data platform has been developed with the needs of their particular business in mind.

So let us first consider what a data platform actually is. As I mentioned in my previous post on instrumentation, clients as well as web services produces data in real time, in quantities that can range from minuscule to tidal proportions depending on the size of the service or application  being instrumented. Essentially all this data arrives in the cloud in the form of “events”. An event has two parts:

  • Essential information applicable to all events, such as the time when the event was created, the place where it was created, and the device and/or user who created. An event which lacks any of this information is considered malformed and need not even be processed further. (The rate of such malformed events is an important data quality metric to watch when considering the quality of the data platform.)
  • Custom information, that is specific to the event. For example in a search engine, the event of a query being formulated will have custom information pertaining the text string of the query. In a software client like Microsoft Word, the event about opening a file may contain information about the file being opened such as its name, type and indexed keywords. In the case of our HUD display in a car, an event could be an occurrence of a fender bender (collision) detected by sensors in the bumpers, and the custom information could include the speed of the car when the event happened. Custom information, being “custom”, defies any standardized schema, and is best described as a simple hierarchical property bag such as a JSON dictionary.

The first, public facing, part of the data platform is the endpoint where these events are pushed in from clients as well as web services. Ensuring that this endpoint can handle all the incoming events gracefully and quickly is obviously important. Dropping too many events because queues overflow, or inability to route events to proper downstream processors due to downstream congestion, would be signs of a creaking platform. A good design would ensure that as the platform gets loaded it systematically (down) samples according to some well defined logic – down sampling those events first that are not critical, while ensuring that this down sampling information is recorded in the retained events as well as in diagnostic logs. Moreover, since storage has now become so cheap, companies also often simply store all the raw events, at least for a few months, just in case they need to analyze them again in newer, as yet unconsidered, ways.

Once the events have arrived, they can be processed in various ways, the most important example of which is constructing a user’s time line. That is, we collect events of each unique user, order them by time of occurrence, and then create chunks of usage called “sessions”. A session could be simply a clock hour worth of activity, or more elaborately it could be activity pertaining to a particular task, which will be very specific to the business concerned. For example, in search land, a session may be the activity that a user does for a particular intent, before disappearing for a prolonged time, say an hour. (The assumption here would be that when the user reappears after a long time his intent would be different.) In Word, a session could be the activity pertaining to a particular document (opening, reading, editing, sharing, closing). In our car’s HUD Example, a session could be a single trip made by the car  (between an engine turned ON to turned OFF event, both inclusive.) Identifying and grouping activity by session, while not terribly critical in a mathematical sense, can have far-reaching practical implications about how the data is viewed and exploited by the organization. A session will inevitably be considered as a “single use” of the product, and before you know it, increasing the number of sessions would itself be a primary business metric. The number of sessions is also used as a very important normalizer of all types of metrics. In search, we may calculate successes per session, in Word we may calculate crashes per session, and for a car we may also want to know crashes (literally collisions!) per session.  For this reason it is very important to align the definition of “session”  with how the top management of the company views a single customer visit. Otherwise be prepared for unending agony whenever you get confounding analysis results!

Once the a user’s data has been partitioned into sessions, and this has been done for all the users, the data platform needs to enrich each session with meta-data. This is data that may come from other events, or joined (“merged”) with other internal/external sources of data, or even be computed or inferred from the incoming instrumented data itself. For example, in a search session, the meta data could include a unique session ID, a class indicator for the session based on the nature of queries happening in it (education/commerce/porn etc) , and information regarding market, demographic etc of the user (if available). In Word, a session could be enriched with data about the nature of the device (tablet/PC/phone) and the type of network being used (WiFi/Cellular). For the car, the session could be enriched with information about whether there were multiple people in the car or only the driver was present, and whether the weather was rainy or sunny (This data could come from weather.com, rather than from the car telemetry!). Session meta data is often used for pivoting the metrics reported from the data and, even more importantly, also used for segmenting an AB analysis. (We will talk more about it later).

Other than session based processing, other types of processing is also possible, say based upon device, or market. I won’t talk too much about it here.

Once all this data cooking has been done, the session stream is stored in a Hadoop type distributed infrastructure where it can be processed by map-reduce systems. In this case the stream is “sharded” (i.e. split) over thousands of nodes, with redundancy. A stream is basically a row-set, where each row has a bunch of fields (columns) corresponding to the information about one session.  There will be columns for the meta data, and typically a column for the bag-of-events comprising the session.

As a general rule the size of the stream produced is smaller than the size of the raw events that compose into it, because typically we discard a lot of information that is not pertinent to the “view”  at hand (in our case, the session “view”). This allows analysts to analyze the streams far more quickly. (Analyzing Terabytes of data can be quite time consuming  even if they are using large map-reduce cluster processing.)

Alternatively, the data can also be made available in the form of real time streams and used for complex event processing. Some kind of down-sampling (dropping some sessions) or pruning (dropping some columns) is usually needed here because otherwise the real time hose would be unmanageable.  This is currently the cutting edge of data analysis and data science, and all indications are that many types of analysis that were hitherto considered only possible in an offline “map-reduce” sense, will slowly transition to being done in real time CEP systems. However this differentiation is mainly about technology and not really conceptual, so let us not worry too much about it for now. (It does have implications on agility – how quickly decisions can be made based on the data).

One could say a lot more about how a good data platform should be architected, but probably it will need a whole book by itself. The main message that I want to get out here is that a good data platform needs to be agile and scalable. Agility comes not merely from the brute force power used in receiving and processing events, but in more nuanced ways by correct choice of concepts like session, which affects the complexity of the entire downstream analysis. Similarly, scalability is not simply about a lot of storage and a lot of map-reduce horsepower. It also comes from the ease at which you can tweak your stream/views schema, the ease with which your system can on-board entirely new types of instrumentation,  the ease with which you can join multiple stream (internal or external) and the ease with which you can add to/modify the data enrichment process. When making investments, management needs to very carefully understand what will give the most impact for the type of questions the business wants to answer. There is a danger that a sense of lethargy and fatalism can creep in – “we do things in this way because that’s how we have done it so far and it has worked.”  Data platforms, like the companies customer facing products, also need to evolve and remain relevant.

So while all data platforms are essentially equal, some indeed are more equal that others!

(Next time, I will talk about Controlled Experimentation – how to draw actionable insights while controlling for randomness and bias.)