This guest post comes courtesy of Tony Baer's OnStrategies blog. Tony is senior analyst at Ovum.
By Tony Baer
With
Strata,
IBM IOD, and
Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for
Hadoop-related
announcements. The common thread of announcements is essentially, “We
know that Hadoop is not known for performance, but we’re getting better
at it, and we’re going to make it look more like
SQL.” In essence,
Hadoop and SQL worlds are converging, and you’re going to be able to
perform interactive BI analytics on it.
|
Tony Baer |
The opportunity and challenge of
Big Data from new platforms such as
Hadoop is that it opens a new range of analytics. On one hand, Big Data
analytics have updated and revived programmatic access to data, which
happened to be the norm prior to the advent of SQL. There are plenty of
scenarios where taking programmatic approaches are far more efficient,
such as dealing with
time series data or
graph analysis to map many-to-many relationships.
It also leverages in-memory data grids such as
Oracle Coherence,
IBM WebSphere eXtreme Scale,
GigaSpaces
and others, and, where programmatic development (usually in
Java)
proved more efficient for accessing highly changeable data for web
applications where traditional paths to the database would have been
I/O-constrained. Conversely Advanced SQL platforms such as
Greenplum and
Teradata Aster have provided support for
MapReduce-like
programming because, even with structured data, sometimes using a Java
programmatic framework is a more efficient way to rapidly slice through
volumes of data.
But when you talk analytics, you can’t simply write off the legions
of SQL developers that populate enterprise IT shops.
Until now, Hadoop has not until now been for the SQL-minded. The
initial path was, find someone to do data exploration inside Hadoop, but
once you’re ready to do repeatable analysis,
ETL (or ELT) it into a SQL
data warehouse. That’s been the pattern with
Oracle Big Data Appliance
(use Oracle loader and data integration tools), and most Advanced SQL
platforms; most data integration tools provide Hadoop connectors that
spawn their own MapReduce programs to ferry data out of Hadoop. Some
integration tool providers, like
Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and
Hortonworks have been talking up the potentials of
HCatalog, in actuality an enhanced version of
Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.
But when you talk analytics, you can’t simply write off the legions
of SQL developers that populate enterprise IT shops. And beneath the
veneer of chaos, there is an implicit order to most so-called
“unstructured” data that is within the reach programmatic transformation
approaches that in the long run could likely be automated or packaged
inside a tool.
At
Ovum, we have long believed that
for Big Data to crossover to the mainstream enterprise, that it must
become a first-class citizen with IT and the
data center. The early
pattern of skunk works projects, led by elite, highly specialized teams
of software engineers from Internet firms to solve Internet-style
problems (e.g., ad placement,
search optimization, customer online
experience, etc.) are not the problems of mainstream enterprises. And
neither is the model of recruiting high-priced talent to work
exclusively on Hadoop sustainable for most organizations; such staffing
models are not sustainable for mainstream enterprises. It means that Big
Data must be consumable by the mainstream of SQL developers.
Making Hadoop more SQL-like is hardly new
Hive and
Pig became Apache Hadoop
projects because of the need for SQL-like
metadata management and data
transformation languages, respectively;
HBase emerged because of the
need for a table store to provide a more interactive face – although as a
very sparse, rudimentary column store, does not provide the efficiency
of an optimized SQL database (or the extreme performance of some
columnar variants). Sqoop in turn provides a way to pipeline SQL data
into Hadoop, a use case that will grow more common as organizations look
to Hadoop to provide scalable and cheaper storage than commercial SQL.
While these Hadoop subprojects that did not exactly make Hadoop look
like SQL, they provided building blocks from which many of this week’s
announcements leverage.
Progress marches on
One train of thought is that if Hadoop can look more like a SQL
database, more operations could be performed inside Hadoop. That’s the
theme behind Informatica’s long-awaited enhancement of its PowerCenter
transformation tool to work natively inside Hadoop. Until now,
PowerCenter could extract data from Hadoop, but the extracts would have
to be moved to a staging server where the transformation would be
performed for loading to the familiar SQL data warehouse target. The new
offering,
PowerCenter Big Data Edition,
now supports an ELT pattern that uses the power of MapReduce processes
inside Hadoop to perform transformations. The significance is that
PowerCenter users now have a choice: load the transformed data to HBase,
or continue loading to SQL.
There is growing support for packaging Hadoop inside a common
hardware appliance with Advanced SQL. EMC Greenplum was the first out of
gate with
DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with
Greenplum MR, a software only product that is accompanied by a
MapR Hadoop distro).
Teradata Aster has just joined the fray with
Big Analytics Appliance, bundling the
Hortonworks Data Platform
Hadoop; this move was hardly surprising given their growing partnership
around HCatalog, an enhancement of the SQL-like Hive metadata layer of
Hadoop that adds features such as a cost optimizer and RESTful
interfaces that make the metadata accessible without the need to learn
MapReduce or Java. With HCatalog, data inside Hadoop looks like another
Aster data table.
Not coincidentally, there is a growing array of analytic tools that
are designed to execute natively inside Hadoop. For now they are from
emerging players like
Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like
marketplace for developers),
Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry,
Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).
Yet, even with Hadoop analytic tooling, there will still be a desire
to disguise Hadoop as a SQL data store, and not just for data mapping
purposes.
Yet, even with Hadoop analytic tooling, there will still be a desire
to disguise Hadoop as a SQL data store, and not just for data mapping
purposes.
Hadapt has been promoting a variant where it squeezes SQL tables inside
HDFS
file structures – not exactly a no-brainer as it must shoehorn tables
into a file system with arbitrary data block sizes. Hadapt’s approach
sounds like the converse of object-relational stores, but in this case,
it is dealing with a physical rather than a logical impedance mismatch.
Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does
Cloudera. It has just announced
Impala,
a SQL-based alternative to MapReduce for querying the SQL-like Hive
metadata store, supporting most but not all forms of SQL processing
(based on
SQL 92; Impala lacks triggers, which Cloudera deems low
priority). Both Impala and MapReduce rely on parallel processing, but
that’s where the similarity ends. MapReduce is a blunt instrument,
requiring Java or other programming languages; it splits a job into
multiple, concurrently, pipelined tasks where, at each step along the
way, reads data, processes it, and writes it back to disk and then
passes it to the next task.
Conversely, Impala takes a shared nothing,
MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera
claims roughly 4x performance against MapReduce; if the data is in
HBase, Cloudera claims performance multiples up to a factor of 30. For
now, Impala only supports row-based views, but with columnar (on
Cloudera’s roadmap), performance could double. Cloudera plans to release
a
real-time query (RTQ) offering that, in effect, is a commercially
supported version of Impala.
By contrast, Teradata Aster and Hortonworks promote a
SQL MapReduce
approach that leverages HCatalog, an incubating Apache project that is a
superset of Hive that Cloudera does not currently include in its
roadmap. For now, Cloudera claims bragging rights for performance with
Impala; over time, Teradata Aster will promote the manageability of its
single appliance, and with the appliance has the opportunity to counter
with hardware optimization.
The road to SQL/programmatic convergence
Either way – and this is of interest only to purists – any SQL extension
to Hadoop will be outside the Hadoop project. But again, that’s an
argument for purists. What’s more important to enterprises is getting
the right tool for the job – whether it is the flexibility of SQL or raw
power of programmatic approaches.
SQL convergence is the next major battleground for Hadoop. Cloudera
is for now shunning HCatalog, an approach backed by Hortonworks and
partner Teradata Aster. The open question is whether Hortonworks can
instigate a stampede of third parties to overcome Cloudera’s resistance.
It appears that beyond Hive, the SQL face of Hadoop will become a
vendor-differentiated layer.
Part of conversion will involve a mix of cross-training and tooling
automation. Savvy SQL developers will cross train to pick up some of the
Java- or Java-like programmatic frameworks that will be emerging.
Tooling will help lower the bar, reducing the degree of specialized
skills necessary.
And for programming frameworks, in the long run,
MapReduce won’t be the only game in town. It will always be useful for
large-scale jobs requiring brute force, parallel, sequential processing.
But the emerging
YARN
framework, which deconstructs MapReduce to generalize the resource
management function, will provide the management umbrella for ensuring
that different frameworks don’t crash into one another by trying to grab
the same resources. But YARN is not yet ready for primetime – for now
it only supports the batch job pattern of MapReduce. And that means that
YARN is not yet ready for Impala or vice versa.
Either way – and this is of interest only to purists – any SQL extension
to Hadoop will be outside the Hadoop project. But again, that’s an
argument for purists.
Of course, mainstreaming Hadoop – and Big Data platforms in general –
is more than just a matter of making it all look like SQL. Big Data
platforms must be manageable and operable by the people who are already
in IT; they will need some new skills and grow accustomed to some new
practices (like exploratory analytics), but the new platforms must also
look and act familiar enough. Not all announcements this week were about
SQL; for instance, MapR is throwing a gauntlet to the Apache usual
suspects by extending its management umbrella beyond the proprietary
NFS-compatible file system that is its core IP to the MapReduce
framework and HBase, making a similar promise of high performance.
On
the horizon, EMC
Isilon and
NetApp
are proposing alternatives promising a more efficient file system but
at the “cost” of separating the storage from the analytic processing.
And at some point, the Hadoop vendor community will have to come to
grips with capacity utilization issues, because in the mainstream
enterprise world, no CFO will approve the purchase of large clusters or
grids that get only 10 – 15 percent utilization. Keep an eye on
VMware’s
Project Serengeti.
They must be good citizens in data centers that need to maximize
resource (e.g., virtualization, optimized storage); must comply with
existing data stewardship policies and practices; and must fully support
existing enterprise data and platform security practices. These are all
topics for another day.
This guest post comes courtesy of Tony Baer's OnStrategies blog. Tony is senior analyst at Ovum.
You may also be interested in: