Integrating SAP with Hadoop – it’s possible, but…


Background story

Nearly everybody has heard of SAP, but most IT professionals have a hard time understanding the whole fuss about SAP software and data in SAP systems. After all, SAP has a reputation of being a black box, producing very hard to use and cryptic software which comes at a high price. Let’s take this step by step.

SAP is a software company headquartered in Germany specialized in business software. Being founded more than 40 years ago, makes them a dinosaur in the IT industry. A very important, agile, and innovative dinosaur though. Just imagine nearly 76% of all business transactions on our planet are running through an SAP system[1]! Coming from a strong focus on standard business software, SAP has made its way into various industries and technologies. In recent years SAP has spearheaded innovation in data processing using its in-memory columnar database SAP HANA, enabling extremely fast and efficient processing of data (albeit at a high cost).

SAP is a lone wolf in the data world

Integrating SAP systems and SAP data with a non-SAP software is possible, but depending on requirements, type of data, and technology used, may prove to be challenging and difficult. There are two reasons to it:

  1. SAP provides various ways to build interfaces, but most of these were designed with a business purpose in mind (e.g. EDI, which connects different SAP systems and non-SAP systems in a message-style interface of business processes).
  2. It is difficult to find and identify the relevant data in SAP systems – especially if you’re not deeply familiar with SAP systems and the SAP ERP data model.

In this blog post, I will explain how to tackle both challenges: technical integration of SAP and Big Data platforms, as well as how to identify and find the right data in SAP systems.

Why would you want to integrate SAP data?

There are several good reasons for why one wants to integrate data lakes or Big Data applications with SAP:

  • TCO Reduction: Offloading (or archiving) aged SAP data is a smart way to reduce the costs of SAP systems by simply storing SAP data on a cheaper technology
  • Data Integration: SAP data can offer treasures for data scientists when integrated into Big Data scenarios. Think fraud detection, predictive maintenance, anomaly detection, machine learning to optimize production, customer 360…

There is a wide range of scenarios which require the integration of business data from SAP systems and Hadoop data:


The integration challenge

When integrating SAP with Big Data applications, you will run into several challenges. We have identified three major gaps:

  • The IT culture gap, i.e. the way how development and software logistics are performed. There is a fundamental difference between the worlds of Big Data and SAP.
  • The Operations gap, i.e. running applications, job scheduling, orchestration, and the way of handling patches and upgrades
  • The Data Integration gap, which is all about ETL, loading of data, transforming of data.

Remember that SAP is a heavily controlled and regulated environment, which has to be fully SOX-compliant (SOX meaning Sarbanes-Oxley, which is a set of rules which enterprises have to comply with for financial reporting).

To overcome those challenges, we developed a technology that tightly integrates data on the SAP side. It connects SAP and Hadoop seamlessly – no additional server for any ETL or integrations is required.

Glue: Datavard’s integration technology on the SAP side

Our technology for integrating SAP and Hadoop is called Datavard Glue. Using Datavard Glue, data scientists can easily identify data in SAP systems. They can set up data replication from SAP to Hadoop and use SAP data in their Big Data applications (e.g. using the Data Science Workbench). Of course, it is also possible to consume data from Cloudera Hadoop in SAP applications.

By integrating SAP and Hadoop closely and embedding the integration on the SAP side, we solve several challenges. The orchestration of jobs, scheduling, and access follows SAP standard rules. This helps with overall integration, but also with access control. The SAP authorization concept can be leveraged, and makes it possible to fine tune the access to the integration application itself, and to SAP application data. Glue is also embedded into SAP’s software logistics (i.e. the so-called Transport Management System), and the Big Data integration can be built following SAP rules of software deployment where you use a multi-system landscape with dedicated development and test systems.

The following figure illustrates our solution and how it is embedded.



Using a central component called “Storage Management” we connect SAP and Cloudera Hadoop. Storage Management includes various technologies which we use to bridge different technologies: JDBC/ODBC, REST, SAP RFC, and others. Using Storage Management as a middleware, Datavard Glue allows to develop data models on Hadoop, to set up ETL processes to extract & transform data from SAP, to consume Hadoop data from SAP.

Based on Datavard Storage Management we have built two applications:

  • Datavard OutBoard designed for offloading (called Nearline Storage or simply archiving) in the SAP universe. Archiving includes data verification and data deletion from the SAP system. This requires specialized interfaces, and is therefore a specialized (technical) use case of any Big Data integration. Offloading is a one-way road: Data is extracted and removed from SAP, but it is still consumed exclusively in SAP applications (even though it is stored elsewhere, e.g. on Hive with Impala). We often refer to this scenario as cut-and-paste of data.
  • Datavard Glue opens Storage Management for true integration, i.e. a copy-and-paste style of preparing data, enriching data, and making data available for Hadoop.

The following figure illustrates how Storage Management helps us connect SAP applications built in SAP native programming language “ABAP” with Hadoop.


However, this is just technology. Since an average SAP system comes with close to 100,000 database tables (some of which will be extended with custom fields or even additional custom tables) it is extremely hard to even find the data you need. This difficulty is addressed through our Business Process Library (BPL). Using the BPL, you can work with business objects instead of cryptic tables, e.g. you would work with the object “customer” which encapsulates all tables in SAP which contain customer master data. The BPL allows Data Scientists to find data in SAP, and even to analyse custom data models and custom code. Furthermore, it provides an easy way of translating technical, cryptic field names and values into speaking data.

Download e-book on integrating SAP with Hadoop


[1] Source:

One comment on “Integrating SAP with Hadoop – it’s possible, but…

Leave a Reply

Your email address will not be published. Required fields are marked *