Data warehouse automation in the age of Hadoop-based data lakes
A number of vendors including WhereScape, TimeXtender, BIReady (now owned by Attunity) and Kalido (now owned by Magnitude Software) established the need for data warehouse automation (DWA) technology, providing design and development software for automating the creation of business model-driven data marts and data warehouses. As the era of the data warehouse gives way to the age of the Hadoop-based data lake, is there a continuing role for DWA, and how are these vendors adjusting their products and strategies for the next phase of enterprise data processing and analytics platforms?
The 451 Take
It is clearly early days for enterprises taking advantage of DWA technologies in relation to Hadoop and Hadoop-based data lakes. However, it is also clear that DWA and Hadoop are not chalk and cheese. The key advantage of DWA in relation to Hadoop is likely to be driven initially by automating data integration and data movement and reducing the complexity of configuring and maintaining a data lake. However, as enterprises increase the breadth and depth of their data lake environments, there are also likely to be an increasing number of individual application use cases that have relatively fixed schema that could benefit from the automation of predefined models.
Back in 2009's Warehouse Optimization report, 451 Research made the case for model-driven approaches to data warehouse generation in overcoming the inflexibility of predefined schema. Much has changed in the last eight years, not least the emergence of new terminology –DWA – to describe the model-driven approach to data warehouse generation, while two of the early DWA pioneers – Kalido and BIReady – have been acquired (by Magnitude and Attunity, respectively).
Another significant change since 2009 has been the emergence of the Apache Hadoop ecosystem and Hadoop-based data lakes as inherently more flexible alternatives to traditional data warehouses based on analytic relational databases. Whereas traditional data warehouses relied on predefined schema-on-write approach to improve query performance, with the tradeoff being inflexibility to change, Hadoop's schema-on-read approach provided it with the flexibility to adapt to change, and also support multiple applications simultaneously (as a so-called data lake). The rise of Hadoop, and its long-term threat to the incumbent data warehouse providers, could have significant implications for the DWA specialists. DWA remains a niche market, with only a small proportion of overall data warehouse users choosing to take the DWA approach to automate the generation of their data warehouse environments. Hadoop and data lake are by no means set to eclipse traditional data warehouses in the foreseeable future. However, they arguably pose a more immediate threat to the DWA specialists. After all, if the primary point of DWA is to overcome the inflexibility of analytic databases, does DWA still have a role alongside the inherently more flexible data lake?
Interestingly, the four major DWA specialists all have slightly different perspectives on that question:
- WhereScape told us in October 2016 that it was seeing increasing demand for Hadoop-based deployments, having introduced native connectors for the Hadoop Distributed File System (HDFS) and Hive in October 2015.
- In comparison, while Attunity sees Hadoop-driven interest in its Replicate and Visibility products, it told us in March that it has yet to see customers apply its Compose DWA software to Hadoop and data lake environments to a significant degree.
- TimeXtender has yet to see demand for Hadoop-based deployments, which is hardly surprising given that it has been focused to date exclusively on DWA for Microsoft's SQL Server database. However, the company told us in March that it is looking to expand its addressable market beyond Microsoft SQL Server to address cloud- and/or Hadoop-based data lake environments via its new Discovery Hub product.
- Magnitude, meanwhile, reports that it is seeing customer demand for its Dynamic Information Warehouse software to play a role in the coexistence of Hadoop and data warehouse environments, and foresees a greater role for DWA as a direct complement to Hadoop in the future.
To some extent, these different perspectives are born out of the different strategies of the various DWA vendors, and it is worth taking a closer look at each to understand the context for their perspectives. In doing so, it will become clear that despite these different strategies, they are, in fact, broadly in agreement on the potential role for DWA as a complement to Hadoop-based data lakes.
As noted above, WhereScape is the most advanced of the DWA providers in embracing Hadoop, having delivered native connectors for the HDFS and Hive in October 2015. While the majority of clients are still experimenting with building data warehouses in Hadoop, the company reports that maybe 20% are actually using it, mostly for offload projects.
The company's recently appointed CEO, Mark Budzinski, recently told us he sees WhereScape's role as satisfying the needs of data administrators to address multiple analytics environments, including cloud- and Hadoop-based, as well as on-premises data warehouses. Specifically, he sees the importance of automated modeling and ETL workloads in enabling self-service analytics. Budzinski confirmed that the company already has customers automatically loading data into HDFS and generating Hive instances, primarily for pre-processing of large data sets. In the longer term, he sees the Hadoop ecosystem becoming just another target for automated ETL pipelines, with Spark emerging as the primary platform for ETL processing.
Attunity acquired BIReady in 2014 and in 2016 relaunched its Compose product to combine BIReady's data model design and automation functionality with the data-loading functionality of Attunity's existing Replicate product. Replicate remains available as a stand-alone offering, and as we recently reported, the company is seeing some success with Replicate for Hadoop being used to ingest data into Hadoop-based data lake projects. Attunity also has some interest in Hadoop via its Visibility product, which provides an analytics environment for analyzing data usage and activity trends across multiple environments.
Specifically, in relation to Hadoop, Visibility provides users with the ability to measure and understand storage usage and requirements related to the Hadoop Distributed File System, as well as data-processing engines including MapReduce, Tez, Hive and Cloudera Impala. Attunity Compose is primarily designed to automatically generate data structures and ETL processes to eliminate manual ETL coding for the creation of data warehouses and data marts on Teradata, Oracle Database, Microsoft SQL Server and Amazon Redshift. The automation of ETL development and deployment could, theoretically at least, be equally applied to Hadoop-based data lakes, and while Attunity has not seen significant demand to date from customers for Compose on Hadoop, given recent large-scale data lake-related contract wins for Replicate and Visibility, it is surely only a matter of time.
As we recently explained, to create its new Discovery Hub offering, TimeXtender essentially dismantled its former TX DWA product and rebuilt it with the addition of a semantic layer that enables users to build out and reuse models of data based on metadata. It also has a more flexible loosely normalized approach, as opposed to its traditional focus on a strict adherence to a predefined dimensional model and star schema. The fact that TimeXtender supports only SQL Server was fortunate because it didn't have to test and certify the new Discovery functionality with multiple databases.
Having developed it, however, the company is now in a better position to expand its focus beyond SQL Server to multiple environments, including Hadoop and cloud data stores. TimeXtender is planning to do just that later this year, taking advantage of the ability to build a semantic data model that can automate the creation of data warehouses to be applied to multiple environments. Again, the potential advantage is based on automating the generation of ETL code, as well as the building and development of the warehouse, the generation of documentation and the building of any required OLAP cubes for use by self-service visualization environments.
Magnitude was formed in April 2014 by the merger of data warehousing automation and master data management (MDM) software provider Kalido with corporate performance management (CPM) specialist Noetix, and further expanded through the acquisition of ERP-focused data warehousing provider Datalytics in May 2016 and data-access connectivity specialist Simba Technologies in August 2016. The company explained to us earlier this year how it was combining these assets to provide a single platform that enables enterprises to deliver data and analytics to fuel CPM use cases.
Magnitude sees Hadoop fitting into this picture in two ways. The first is the use of Hadoop as a data source for raw data, to be pulled into the data warehouse and treated to data validation, integrity and defined dimensional structures and hierarchies as appropriate to complement governed data sources. The second, and emerging, role is to publish enterprise data into a data lake environment and to use the Magnitude Dynamic Information Warehouse (DIW) to create a DWA- and MDM-driven data warehousing environment in Hadoop using SQL-on-Hadoop environments such as Hive, Impala or Spark SQL. There is no inherent advantage in using Hadoop for the latter use case – indeed, Magnitude admits that it is in opposition to the flexibility of the schema-on-read approach – but it would make sense if a company has chosen the data lake architecture as its strategic analytics platform. Either way, Magnitude is eager to point out that it is not suggesting that enterprises apply defined dimensional structures and hierarchies and MDM to all data within the data lake, but only where it makes sense to do so for the given use case(s). This combination of DIW and Hadoop should enable a flexible modeling environment that automates database management and supports dynamic changes to the model.
DWA may at first glance seem to be inappropriate for a Hadoop environment, but keep in mind that 'data warehouse' is a concept that is independent of the underlying data processing and storage technology, so there is no reason, theoretically at least, that a data warehouse shouldn't be based on Hadoop rather than a relational database. Keep in mind also that SQL-on-Hadoop projects such as Hive, Impala and Spark SQL serve to bring some of the benefits of SQL-based processing to Hadoop. We have also seen the emergence of a number of OLAP-on-Hadoop providers (AtScale, Kyvos Insights and Arcadia Data, for example), as well as visual analytics for Hadoop.
What is alien to Hadoop, given the flexibility advantages of its schema-on-read approach, is the use of predefined dimensional models and schema, but again there may be use cases, particularly in a multi-function data lake environment, where the schema is relatively fixed. Either way, the key potential benefit of DWA to Hadoop-based data lakes lies not in 'data warehouse,' but 'automation.' Hadoop-based data lakes are complex to configure and maintain and the ability to automate modeling and ETL pipeline development and deployment, where it makes sense to do so, is likely to become increasingly appealing as data lake environments are deployed to serve a greater breadth of enterprise analytic requirements.