Integrating SAP with Hadoop – it’s possible, but…


Background story

Nearly everybody has heard of SAP, but most IT professionals have a hard time understanding the whole fuss about SAP software and data in SAP systems. After all, SAP has a reputation of being a black box, producing very hard to use and cryptic software which comes at a high price. Let’s take this step by step.

SAP is a software company headquartered in Germany specialized in business software. Being founded more than 40 years ago, makes them a dinosaur in the IT industry. A very important, agile, and innovative dinosaur though. Just imagine nearly 76% of all business transactions on our planet are running through an SAP system[1]! Coming from a strong focus on standard business software, SAP has made its way into various industries and technologies. In recent years SAP has spearheaded innovation in data processing using its in-memory columnar database SAP HANA, enabling extremely fast and efficient processing of data (albeit at a high cost).

SAP is a lone wolf in the data world

Integrating SAP systems and SAP data with a non-SAP software is possible, but depending on requirements, type of data, and technology used, may prove to be challenging and difficult. There are two reasons to it:

  1. SAP provides various ways to build interfaces, but most of these were designed with a business purpose in mind (e.g. EDI, which connects different SAP systems and non-SAP systems in a message-style interface of business processes).
  2. It is difficult to find and identify the relevant data in SAP systems – especially if you’re not deeply familiar with SAP systems and the SAP ERP data model.

In this blog post, I will explain how to tackle both challenges: technical integration of SAP and Big Data platforms, as well as how to identify and find the right data in SAP systems.

Why would you want to integrate SAP data?

There are several good reasons for why one wants to integrate data lakes or Big Data applications with SAP:

  • TCO Reduction: Offloading (or archiving) aged SAP data is a smart way to reduce the costs of SAP systems by simply storing SAP data on a cheaper technology
  • Data Integration: SAP data can offer treasures for data scientists when integrated into Big Data scenarios. Think fraud detection, predictive maintenance, anomaly detection, machine learning to optimize production, customer 360…

There is a wide range of scenarios which require the integration of business data from SAP systems and Hadoop data:


The integration challenge

When integrating SAP with Big Data applications, you will run into several challenges. We have identified three major gaps:

  • The IT culture gap, i.e. the way how development and software logistics are performed. There is a fundamental difference between the worlds of Big Data and SAP.
  • The Operations gap, i.e. running applications, job scheduling, orchestration, and the way of handling patches and upgrades
  • The Data Integration gap, which is all about ETL, loading of data, transforming of data.

Remember that SAP is a heavily controlled and regulated environment, which has to be fully SOX-compliant (SOX meaning Sarbanes-Oxley, which is a set of rules which enterprises have to comply with for financial reporting).

To overcome those challenges, we developed a technology that tightly integrates data on the SAP side. It connects SAP and Hadoop seamlessly – no additional server for any ETL or integrations is required.

Glue: Datavard’s integration technology on the SAP side

Our technology for integrating SAP and Hadoop is called Datavard Glue. Using Datavard Glue, data scientists can easily identify data in SAP systems. They can set up data replication from SAP to Hadoop and use SAP data in their Big Data applications (e.g. using the Data Science Workbench). Of course, it is also possible to consume data from Cloudera Hadoop in SAP applications.

By integrating SAP and Hadoop closely and embedding the integration on the SAP side, we solve several challenges. The orchestration of jobs, scheduling, and access follows SAP standard rules. This helps with overall integration, but also with access control. The SAP authorization concept can be leveraged, and makes it possible to fine tune the access to the integration application itself, and to SAP application data. Glue is also embedded into SAP’s software logistics (i.e. the so-called Transport Management System), and the Big Data integration can be built following SAP rules of software deployment where you use a multi-system landscape with dedicated development and test systems.

The following figure illustrates our solution and how it is embedded.



Using a central component called “Storage Management” we connect SAP and Cloudera Hadoop. Storage Management includes various technologies which we use to bridge different technologies: JDBC/ODBC, REST, SAP RFC, and others. Using Storage Management as a middleware, Datavard Glue allows to develop data models on Hadoop, to set up ETL processes to extract & transform data from SAP, to consume Hadoop data from SAP.

Based on Datavard Storage Management we have built two applications:

  • Datavard OutBoard designed for offloading (called Nearline Storage or simply archiving) in the SAP universe. Archiving includes data verification and data deletion from the SAP system. This requires specialized interfaces, and is therefore a specialized (technical) use case of any Big Data integration. Offloading is a one-way road: Data is extracted and removed from SAP, but it is still consumed exclusively in SAP applications (even though it is stored elsewhere, e.g. on Hive with Impala). We often refer to this scenario as cut-and-paste of data.
  • Datavard Glue opens Storage Management for true integration, i.e. a copy-and-paste style of preparing data, enriching data, and making data available for Hadoop.

The following figure illustrates how Storage Management helps us connect SAP applications built in SAP native programming language “ABAP” with Hadoop.


However, this is just technology. Since an average SAP system comes with close to 100,000 database tables (some of which will be extended with custom fields or even additional custom tables) it is extremely hard to even find the data you need. This difficulty is addressed through our Business Process Library (BPL). Using the BPL, you can work with business objects instead of cryptic tables, e.g. you would work with the object “customer” which encapsulates all tables in SAP which contain customer master data. The BPL allows Data Scientists to find data in SAP, and even to analyse custom data models and custom code. Furthermore, it provides an easy way of translating technical, cryptic field names and values into speaking data.

Download e-book on integrating SAP with Hadoop


[1] Source:

5 comments on “Integrating SAP with Hadoop – it’s possible, but…

    • varun on

      really? you understood about sap based on this one pager. hmm,,

      i am a sap guy with 12+ years of exp in implementing various end to processes of sap like otc, ptp, rtr, etc.

      i can tell you, it is easy for us sap people to learn big data tools and deliver. it is never going to be possible for people working in big data to learn sap and delivery anything meaningful,, dream on. 🙂

      hell, we don’t even follow sdlc, agile or any such. asap (accelerated sap) is our bible for implementation. sap is a different and high profile world where you work with people reporting cios, ctos, plant incharge,,

      • Goetz Lessmann on

        Hi Varun,

        not sure what exactly you want to tell me ;). If you refer to our BPL with the “one pager”, then let me comment: In general, a one pager will not teach anybody all about SAP, but it gives them some pointers. This helps in projects, but of course people still need to think for themselves. The real value of the BPL is the business content we have prepared and ship with it. This is a considerable accelerator for building data flows in our solution, where you can extremely fast create data models and expose SAP data to Hadoop, with automatically generated lookups into customizing, and “de-SAP-ifying” the data to make it more speaking and useful for data scientists.

        I wouldn’t use ASAP for a Hadoop deployment, though.


        • Sam on

          Think you didn’t read your own paper. With 100,000 tables It is impossible to expose these through a third party tool. SAP isn’t sure what all of the tables are used for. B

          Wouldn’t t you want to use SAP BWs DSO? They are real cool, they are a built in and allows you to work with objects instead of encryptic tables.

          Also SAP is a company with many different products. Which ones do you interface with?

          • Goetz Lessmann on

            Hi Sam,

            some of our customers use DSOs as a pass-thru for extraction from ECC / S4HANA. However, this comes at a price in terms of footprint, hardware, and complexity. Some extractions for deltas or real-time streaming won’t work with this architecture. Of course, nobody will extract all 100k tables of SAP. The trick is to focus on the few tables which are relevant for a particular use case (e.g. enriching a dashboard for predictive maintenance with asset or equipment data). What we do with our solution is provide a lightweight, extremely flexible, ETL and workbench to identify relevant data, create data models on Hive, and populate them with data. As such, we support all SAP products, especially the business applications built on Netweaver and HANA (both on premise and in the cloud). The challenges cover a wide range between ETL, data federations, delta handling, data modelling, security, and integration with SAP software logistics. Other cloud based SAP solutions such as Success Factors can be integrated via procedural extractors in our solution to tap into the REST APIs which the cloud offerings provide. Please let me know if this information helps you….

            Many thanks,

Leave a Reply

Your email address will not be published. Required fields are marked *