Hadoop and now Spark can be made out to be the answer to every question, to be the cure for cancer – the Koolaid is very powerful. It’s important to remember that while these technologies are certainly game-changers they do not solve every problem (yet!) and come with familiar challenges.
It seems that when it comes to Hadoop/Spark, considerations around ease of implementation and total cost of ownership can easily go out the window. Some businesses believe they will stand up a cluster and reap tremendous insights in a week. They may also proceed into a Hadoop initiative without having defined use cases for their business. The reality is that with just the core components of the platform you will probably need to hire an army of data scientists/developers to get value out of your data. Not an issue for companies named Google and Amazon but more so for most others.
Two things can make Hadoop a more palatable and realistic proposition for the masses:
- Analytics accelerators which make analytics on Hadoop accessible to people without a PhD in mathematics. IBM offers these kinds of “value-adds” – for example BigSheets, which allows analysis of data on Hadoop via a spreadsheet-like interface. SQL on Hadoop also provides a great easy win for the platform, allowing data discovery or the querying of archived data directly on Hadoop with SQL tools. IBM’s BigSQL has been shown to be a strong contender in this area.
- Cloud – eliminating the upfront setup of a cluster as well as on-going administration is huge. Not many of the major Hadoop vendors have a true SaaS offering for Hadoop. Nearly every customer is considering existing or new workloads to put on the cloud; Hadoop-as-a-service makes a lot of sense. IBM’s “eHaaS” offering is BigInsights on Cloud.