It seems a common reason that a new project is started on Hadoop which seems to duplicate already existing capability is that the existing solution is just not built to scale to large data volumes. Often that’s a valid argument, but in the case of Data Integration/Data Quality, there are many mature existing solutions out there in the market. Are they all hamstrung when it comes to big data integration?
IBM’s Information Server, a well-established Data Integration solution, initially featured some capability that allowed pushdown of its workload to Hadoop via MapReduce. Of course MapReduce has in time been shown to not be the most performant tool and been essentially superceded by Spark’s in-memory engine. But customers have been using the Information Server Engine itself in its scale-out configuration for big data transformation for many many years, in very large clusters. From this reality I surmise came the decision to unleash the Information Server engine directly as an application on YARN, as BigIntegrate and BigQuality. The below diagram shows how the engine runs on YARN, but at the core of it an Information Server Application Master which negotiates resources for IS processes with the ResourceManager.
How have other integration vendors designed their Big Data solution? Talend, which initially also pushed workload down into MapReduce has switched over to converting its jobs to Spark. This is logical since Spark is much faster than MapReduce, but I expect also involves some significant coding effort to get right. Informatica’s approach seems a bit more confused or nuanced – they promote their “Blaze” Informatica engine also running on YARN but suggest that their solution “supports multiple processing paradigms, such as MapReduce, Hive on Tez, Informatica Blaze, and Spark to execute each workload on the best possible processing engine” – link. I think this is just because at the end of the day the Informatica engine wasn’t built to handle true big data volumes.
There’s always the option of doing data integration directly with hadoop itself, but there’s not much in the way of a solution there. You can use Sqoop to bring data in, or out but you’ll still end up writing HiveQL and hundreds of scripts.