Continuing with the topic of data driven engineering, today I would like to talk about the Data Platform. In many ways this is the “muscle” of data driven engineering and it differentiates the men from the mice, so to speak.
Some people pursue body building for its own sake, to win competitions and earn bragging rights. Others work out and develop muscles for very particular purposes. Oarsmen develop thick short power muscles in their thighs and lower back, sprinters develop their upper body for propulsion, and marathoners develop long lean muscles in their calves. If Data Platform is a muscle, then its development should be considered to be in the second category. Companies and businesses do not win or lose battles for market share simply because they do or do not have a data platform on steroids. Rather what matters is whether the data platform has been developed with the needs of their particular business in mind.
So let us first consider what a data platform actually is. As I mentioned in my previous post on instrumentation, clients as well as web services produces data in real time, in quantities that can range from minuscule to tidal proportions depending on the size of the service or application being instrumented. Essentially all this data arrives in the cloud in the form of “events”. An event has two parts:
- Essential information applicable to all events, such as the time when the event was created, the place where it was created, and the device and/or user who created. An event which lacks any of this information is considered malformed and need not even be processed further. (The rate of such malformed events is an important data quality metric to watch when considering the quality of the data platform.)
- Custom information, that is specific to the event. For example in a search engine, the event of a query being formulated will have custom information pertaining the text string of the query. In a software client like Microsoft Word, the event about opening a file may contain information about the file being opened such as its name, type and indexed keywords. In the case of our HUD display in a car, an event could be an occurrence of a fender bender (collision) detected by sensors in the bumpers, and the custom information could include the speed of the car when the event happened. Custom information, being “custom”, defies any standardized schema, and is best described as a simple hierarchical property bag such as a JSON dictionary.
The first, public facing, part of the data platform is the endpoint where these events are pushed in from clients as well as web services. Ensuring that this endpoint can handle all the incoming events gracefully and quickly is obviously important. Dropping too many events because queues overflow, or inability to route events to proper downstream processors due to downstream congestion, would be signs of a creaking platform. A good design would ensure that as the platform gets loaded it systematically (down) samples according to some well defined logic – down sampling those events first that are not critical, while ensuring that this down sampling information is recorded in the retained events as well as in diagnostic logs. Moreover, since storage has now become so cheap, companies also often simply store all the raw events, at least for a few months, just in case they need to analyze them again in newer, as yet unconsidered, ways.
Once the events have arrived, they can be processed in various ways, the most important example of which is constructing a user’s time line. That is, we collect events of each unique user, order them by time of occurrence, and then create chunks of usage called “sessions”. A session could be simply a clock hour worth of activity, or more elaborately it could be activity pertaining to a particular task, which will be very specific to the business concerned. For example, in search land, a session may be the activity that a user does for a particular intent, before disappearing for a prolonged time, say an hour. (The assumption here would be that when the user reappears after a long time his intent would be different.) In Word, a session could be the activity pertaining to a particular document (opening, reading, editing, sharing, closing). In our car’s HUD Example, a session could be a single trip made by the car (between an engine turned ON to turned OFF event, both inclusive.) Identifying and grouping activity by session, while not terribly critical in a mathematical sense, can have far-reaching practical implications about how the data is viewed and exploited by the organization. A session will inevitably be considered as a “single use” of the product, and before you know it, increasing the number of sessions would itself be a primary business metric. The number of sessions is also used as a very important normalizer of all types of metrics. In search, we may calculate successes per session, in Word we may calculate crashes per session, and for a car we may also want to know crashes (literally collisions!) per session. For this reason it is very important to align the definition of “session” with how the top management of the company views a single customer visit. Otherwise be prepared for unending agony whenever you get confounding analysis results!
Once the a user’s data has been partitioned into sessions, and this has been done for all the users, the data platform needs to enrich each session with meta-data. This is data that may come from other events, or joined (“merged”) with other internal/external sources of data, or even be computed or inferred from the incoming instrumented data itself. For example, in a search session, the meta data could include a unique session ID, a class indicator for the session based on the nature of queries happening in it (education/commerce/porn etc) , and information regarding market, demographic etc of the user (if available). In Word, a session could be enriched with data about the nature of the device (tablet/PC/phone) and the type of network being used (WiFi/Cellular). For the car, the session could be enriched with information about whether there were multiple people in the car or only the driver was present, and whether the weather was rainy or sunny (This data could come from weather.com, rather than from the car telemetry!). Session meta data is often used for pivoting the metrics reported from the data and, even more importantly, also used for segmenting an AB analysis. (We will talk more about it later).
Other than session based processing, other types of processing is also possible, say based upon device, or market. I won’t talk too much about it here.
Once all this data cooking has been done, the session stream is stored in a Hadoop type distributed infrastructure where it can be processed by map-reduce systems. In this case the stream is “sharded” (i.e. split) over thousands of nodes, with redundancy. A stream is basically a row-set, where each row has a bunch of fields (columns) corresponding to the information about one session. There will be columns for the meta data, and typically a column for the bag-of-events comprising the session.
As a general rule the size of the stream produced is smaller than the size of the raw events that compose into it, because typically we discard a lot of information that is not pertinent to the “view” at hand (in our case, the session “view”). This allows analysts to analyze the streams far more quickly. (Analyzing Terabytes of data can be quite time consuming even if they are using large map-reduce cluster processing.)
Alternatively, the data can also be made available in the form of real time streams and used for complex event processing. Some kind of down-sampling (dropping some sessions) or pruning (dropping some columns) is usually needed here because otherwise the real time hose would be unmanageable. This is currently the cutting edge of data analysis and data science, and all indications are that many types of analysis that were hitherto considered only possible in an offline “map-reduce” sense, will slowly transition to being done in real time CEP systems. However this differentiation is mainly about technology and not really conceptual, so let us not worry too much about it for now. (It does have implications on agility – how quickly decisions can be made based on the data).
One could say a lot more about how a good data platform should be architected, but probably it will need a whole book by itself. The main message that I want to get out here is that a good data platform needs to be agile and scalable. Agility comes not merely from the brute force power used in receiving and processing events, but in more nuanced ways by correct choice of concepts like session, which affects the complexity of the entire downstream analysis. Similarly, scalability is not simply about a lot of storage and a lot of map-reduce horsepower. It also comes from the ease at which you can tweak your stream/views schema, the ease with which your system can on-board entirely new types of instrumentation, the ease with which you can join multiple stream (internal or external) and the ease with which you can add to/modify the data enrichment process. When making investments, management needs to very carefully understand what will give the most impact for the type of questions the business wants to answer. There is a danger that a sense of lethargy and fatalism can creep in – “we do things in this way because that’s how we have done it so far and it has worked.” Data platforms, like the companies customer facing products, also need to evolve and remain relevant.
So while all data platforms are essentially equal, some indeed are more equal that others!
(Next time, I will talk about Controlled Experimentation – how to draw actionable insights while controlling for randomness and bias.)