Wednesday, January 13, 2016

Will Hadoop replace existing DBMS and kill the mainframe?

Hadoop: the hype and the reality

The past few years a lot of discussion existed over the importance of Hadoop for modern computing. Many people believed Hadoop would replace current relational and other database management systems. Other people saw through the hype and proclaimed that Hadoop was just another product, which could be used for processing vast amounts of data, but it would never replace classic DBMS systems.

Hadoop and some of it's ecosystem components took my interest already a few years ago. Much has changed since the initial release and nowadays it becomes clear what the role of Hadoop could be in the near future. 

Hadoop has finally gotten the place where it belongs. Hadoop on itself is not a full-blown database management system. Technically it is a database management system, as it guarantees the safekeeping of data files and a uniform retrieval of data, but in practice HDFS & Yarn, which form the elements of the core Hadoop distribution do not provide logical metadata for analysis. Although this is logical when you want to deal with 'unstructured' data, it does mean that additional logic is required to apply logical metadata to the physical storage.

Hadoop and DBMS

It became obvious that a 35 year legacy of relational databases could not be ignored. Where previously SQL was considered as marginal in the world of Big Data, it has become one of its main assets. Its importance is still growing as integration with the enterprise requires the SQL to be ANSI compliant and performance a key referral to choose for one or the other analysis solution. The success and evolution Apache Hive, Apache Phoenix, Pivotal Hawq, IBM BigSQL and many other SQL-on-Hadoop implementations proof that this evolution will not stop soon.

Much has been done to secure Hadoop too. With the current Apache Knox and Apache Falcon implementations, fully audited secure connectivity with an elaborate label based security is possible. This security can be set dynamically or statically, as you would expect from a relational database.

Hadoop and the mainframe

I could go on and describe many more features that are being introduced in Hadoop. I wouldn't be able to provide you all details and even if I could, the data would have been outdated the minute after I saved this blog text. The bottom line of the statement I want to make is that currently the Hadoop Ecosystems consists out of many functionally segregated software solutions that can interact with each other and provide a fully equipped application environment.

This sounds familiar! On the mainframe we see a similar phenomenon: for each OS functionality, a dedicated software exist: RACF for security, RMF for resource assignment, DB2 for data management, SMF for software distribution... as on Hadoop, many of these components can be replaced by other software components, such as ACF2 for security, without losing the benefit of the platform integration. This idea was long forgotten, as IT focused on program usage on personal computers. Now that the network has regain its original importance, this integration has become a primary concern.

Conclusion 

Will Hadoop replace the mainframe? 

Although mainframe and Hadoop technology are comparable on a given level, they serve two completely opposite ideas. Hadoop is about integrating vast amounts of unstructured data from a wide variety of sources into a single interface; whereas the mainframe is about exact processing of well-determined data sets from within an enterprise context. As such Hadoop can serve the enterprise by providing newly structured data to its core system. The enterprise core system should still run on a platform that primarily focuses on the exact processing of data, possibly a mainframe. From this perspective Hadoop and the mainframe are complementary. 

Will Hadoop replace classic DBMS systems? 

As the Hadoop ecosystem will provide more and more DBMS-like features one might wonder whether classic DBMS will soon be replaced by Hadoop. The answer is twofold: first of all, the Hadoop ecosystem makes intensive use of (R)DBMS systems for many of its elements, hence it will not completely remove the need for a classic DBMS soon, next as mentioned above, the semantics of Hadoop are primarily oriented toward 'unstructured' or 'semi-structured' data and not to structured data. This means that the Hadoop ecosystem is once more a complementary technology that could be added to the current Data Warehouse environment that is primarily focused on well-structured and exact data.

The next step is integration!

Reading the above conclusions, you could argument that there is a very thin line between structured, semi-structured and unstructured data. Badly designed tables in an RDBMS can also hold semi-structured data and well formed JSON can hold fully structured data. Both situations can occur and could be regarded as undermining the above argumentation. The argumentation still holds though, as it is the primary focus of the chosen platform that is important, not how your data looks like.

An important next step is the in-depth integration of Hadoop with classic enterprise systems. Examples can already be found in software like Teradata (allowing predicate pushdown to Hive), IBM Fluid Query (creating a virtual datawarehouse with data from any source) and the Splice Machine (establishing a full RDBMS system on Hadoop). I truly believe that this evolution will continue more and more and that we will evolve in a situation were systems can interact with each other like Lego blocks can be put together. On the long term, we should be able to choose the platform for our application according to our business needs without regard to where its data sources reside.

Friday, June 12, 2015

Welcome to the Big Data Dig!

This new blog will inform you about trends, evolutions and especially solutions in Information Management. Information Management being

'The Big Data Dig' has a wide mission:
  • Inform you on how your data can be stored and in which situation you should choose for the one or the other solution.
  • Inform you on solutions that enable you to integrate your various data repositories into value adding interfaces that are fit for use and purpose.
 The Big Data Dig holds a few ideas in its title
  • Big, because it will deal with a wide area of subjects within the information management domain.
  • Data, because it will discuss data storage, integration, transformation and retrieval solutions
  • Big Data, because it will also discuss Big Data solutions, next to other data solutions such as RDBMS, XML and other database paradigms.
  • Dig, because we will be digging up a lot of information (and as an archaeologist, I should have a lot of experience with digging)
Feel free to comment on the entries and to start new discussions on this subject.

Cheers
Ludovic