Data lakes and data streams: two of the hottest data buzzwords du jour and as likely as any pair to spark an argument between data scientists backing one or the other. But which really is better?
Firstly, what are these lakes and streams?
A data lake is still a fairly new concept that refers to the storage of a large amount of unstructured and semi structured data. It addresses the need to store data in a more agile method compared to traditional databases and data warehouses, where a rigid data structure and data definition is required. The data is usually indexed so that it is searchable, either as text or by a tag which forms part of the schema. The flexibility-factor is that each new stream of data can come with no schema, or its own schema, but either way can still be added to the data lake for future processing.
Why is this useful? Because businesses are producing increasing amounts of useful data, in various formats, speeds and sizes. To realise the full value of this data, it must be stored in a such way that people can dive into the data lake and pull out what they need there and then, without having to define the data dictionary and relational structure of data in advance. This increases the speed at which data can be captured and analysed, and gives much more flexibility for adding new sources to the lake. This makes lakes much more flexible than traditional storage for data scientists or business analysts, who are constantly looking for ways to capture and analyse their data, and even pour it back into the lake to create new data sources from their results. Perhaps someone has run an analysis to find anomalies within a subset of the data and has then contributed this analysis back to the data lake as a new source. However, to get the best out of a complex data lake, a data curator is still recommended to create consistency and allow joins across data from different sources.
A data stream on the other hand, is an even newer concept in the general data science world (except for people who use Complex Event Processing engines which work on streaming data). In contrast to deep storage, it’s a result of the increasing requirement to process and perform real-time analysis on streaming data. Highly scalable real-time analysis is a challenge that very few technologies out there can truly deliver on…yet. The value of the data stream (versus the lake) is the speed and continuous nature of the analysis, without having to store the data first. Data is analysed ‘in motion’.
The data stream can then also be stored. This gives the ability to add further context or compare the real-time data against your historical data to provide a view of what has changed – and perhaps even why (which depending on your solution, may impact responsiveness). For example, by comparing real-time data on trades per counterparty against historical data, it could show that a counterparty, who usually submits a given number of trades a day, has not submitted as many trades as expected. A business can then investigate why this is the case and act in real-time, rather than retroactively or at the end of day. Is it a connection problem with the counterparty, is the problem on the business’ side or the client’s? Is it a problem with the relationship? Perhaps they’ve got a better price elsewhere? All useful insight when it comes to shaping trading strategy and managing counterparty relationships.
The availability of these new ways of storing and managing data has created a need for smarter, faster data storage and analytics tools to keep up with the scale and speed of the data. There is also a much broader set of users out there who want to be able to ask questions of their data themselves, perhaps to aid their decision making and drive their trading strategy in real-time rather than weekly or quarterly. And they don’t want to rely on or wait for someone else such as a dedicated business analyst or other limited resource to do the analysis for them. This increased ability and accessibility is creating whole new sets of users and completely new use cases, as well as transforming old ones.
Look at IT capacity management, for example; hitherto limited to looking at sample historical data in a tool like a spreadsheet and trying to identify issues and opportunities in the IT estate. Now, it is possible to compare real-time historical server data with trading data, i.e. what volume of trades generated what load on the applications processing the trades. It is also possible to spot unusual IT loads before they cause an issue. Imagine an upgrade to a key application: the modern capacity management tools can detect that the servers are showing unusually high load given the volume of trades going through the application, catching a degradation in application performance before a high trading load causes an outage. In the future, by feeding in more varied and richer sources of data (particularly combining IT and business data) and implementing machine learning algorithms, it will be possible to accurately predict server outages or market moves that could trigger significant losses if not caught quickly.
So: which is better, a data lake or a data stream? The answer is both. Businesses need to be able to process and analyse data at increasingly large volumes and speed, and from across a growing number of sources as the data arrives in a stream, along with the ability to both access and analyse the data easily and quickly from a data lake. Historically, the problem has been that standard tooling doesn’t easily allow for mixing these two paradigms – but the world is changing!