Recent years have seen many stories touting the impact of in-memory database management systems (IMDBMSes), which load up all data into memory, as the new direction for ultra-fast, real-time processing. With each week a new VC-funded entrant seems to arrive with claims of being the fastest database ever and accelerating processing by 100X, 1000X, 10000X! Gartner late last year put out a Market Guide for In-Memory DBMS which advises IT decision-makers to investigate this disruptive technology. Poor old relational database management systems are slagged as relics, dinosaurs optimized for on-disk processing.
There’s no doubt the performance of all DBMSes benefits from providing enough memory to fit all data in memory. The reason IMDBMSes are presented by their proponents as gazillions of times faster than traditional databases (even when they are given enough memory to load up all the data into the cache) is because they are designed from the ground up to optimize for in-memory operations vs. disk-based processing.
It seems to me however, that the optimizations that really account for the biggest performance gains can and in many cases have been successfully built into our aforementioned poor “traditional” RDBMSes. In the case of analytics, I would say it is the columnar data structure that is really the game-changer, facilitating efficient aggregation and calculation on individual columns. DB2 10.5 added BLU Acceleration – with this capability you are able to create columnar storage tables in the database. You can provision enough memory for all of them to be loaded up but BLU is also optimized for working on large data sets which do not fit into memory, minimizing I/O from disk. When SAP HANA, one of the supposed “pure” in-memory database players, first came out SAP said that all data should/will be in memory. They have since backtracked with SAP IQ for Near-Line Storage for HANA and more recently HANA Dynamic Tiering for Extended Storage (for “non-active” data). The customer performance results I have seen with BLU are no less impressive than those of HANA, and are typically accomplished with a ton less hardware. Microsoft SQL Server and Oracle have also come out with their own columnar store offerings, the Columnstore Index and Database In-Memory, respectively.
On the OLTP side of the house, I have not seen any real evidence that in-memory optimizations have resulted in tremendous performance acceleration. The major difference maker, which is nothing new, is turning off logging so you avoid touching disk altogether but this is not applicable to all application uses. While grandiose claims of XXXX times speedup for OLTP are common by IMDBMS vendors, substantiated results are harder to find. I would refer you to a very funny old article on how these claims or “benchmarks” are often put together – MySQL is bazillion times faster than MemSQL, a response to a MemSQL “result”. SQL Server In-Memory OLTP, an offering by a traditional vendor, has received some plaudits. But from what I have seen, acceleration is not broadly applicable to an entire workload, is quite limited without the use of C stored procedures (not exactly an in-memory optimization or even a new one), and perhaps has something to do with addressing SQL Server locking issues. I absolutely understand that there are optimizations that can and have been done for more efficient in-memory processing for OLTP, I just don’t know that they have delivered a game-changing performance boost.
An organization’s volume of generated unstructured or semi-structured data has tremendous value. It may be less valuable say on a per mb basis than the information stored in your data warehouse and it may be some time after beginning to collect the data before the value is realized, but there’s no doubt that the details of the minute interactions between customers and systems can be leveraged to transform a business. Moreover, Hadoop is increasingly positioned as a landing zone for ALL an organization’s data, structured and unstructured, where exploratory analysis can be performed, as well as an archive for aged data from a data warehouse – see this blog post by a colleague to see where Hadoop fits in the IBM Watson Foundations vision. It is obvious then that an organization’s Hadoop store will generally contain sensitive data. It could be sensitive personal information governed by regulation or simply valuable and proprietary information, but it needs to be secured just the same as it would in a traditional relational data store.
As I hinted in my last post, the importance of governance of Big Data initiatives was something that was considered early on in IBM’s BigInsights development. Fortunately, IBM already had leading capabilities in-house for security and data privacy and extended these capabilities to the Big Data space. InfoSphere Data Privacy for Hadoop allows an organization to secure their Hadoop environments by:
- Defining and sharing big data project blueprints, data definitions – define a big data glossary of terms, define sensitive data definitions and policies
- Discovering and classifying sensitive big data – discover sensitive data and classify it
- Masking and redacting sensitive data within and for Hadoop systems – de-identify sensitive data either at the source or within Hadoop, and obfuscate data whether structured or unstructured
- Monitoring Hadoop Data Activity – monitor big data sources and the entire Hadoop stack and issue alerts as necessary, gather audit information for reporting purposes
The revelation earlier this year that 100s of thousands of Facebook users were unknowingly subjects in a psychology experiment in 2012 caused widespread negative reaction. According to this WSJ article “Researchers from Facebook and Cornell University manipulated the news feed of nearly 700,000 Facebook users for a week in 2012 to gauge whether emotions spread on social media.” Another interesting read comes from Doug Henschen of InformationWeek titled “Mining WiFi Data: Retail Privacy Pitfalls”. In this article Doug speaks to the value that retailers can realize by mining Wifi data but also the potential pitfalls of being able to track and store the minute behaviors of individuals.
So of course Facebook is not the only organization with a burgeoning wealth of personal customer data; every business looking to gain an edge in its industry is looking to store every piece of data it generates (including data on every single customer interaction) and at some point gain valuable insight from it. Every business with a Big Data initiative needs to carefully consider data privacy and security ramifications. And beyond the ethical decisions around use of data that must be considered is how technology supports governance of data – how is access to data limited and tracked, how do you know what personal data you are storing and how do you mask it?
The critical importance of governance for the success of a Big Data initiative is something IBM recognized very early and something it has invested heavily in for its BigInsights Hadoop offering. I wanted to take a few posts to take a closer look at capabilities for governance included in BigInsights – where they come from, how they work and the business problems they address.