BACK

Glue: Hadoop and Big Data integration with SAP

SAP and Hadoop integration

In this post, I’m following my presentation of Datavard’s April 2016 CIC, our “Customer Innovation Circle” event where we meet with our clients and partners to discuss latest developments in the SAP world. We unveiled a new technology which helps our customers tightly integrate the SAP universe with their Big Data applications.

 

 

Within the last decade, the Big Data wave has run over the IT world. Clusters with tens of thousands of nodes crunch through data using completely new processing paradigms. SAP’s answer to this has always been HANA. Of course, an in-memory database with a column store, R integration, etc. must be the ultimate tool to be faster and better. Or not!?   Well, there is a scalability issue. Maybe not so much in technology, but definitely in costs. Therefore, Hadoop is gaining more and more interest with SAP customers. So far (at least until, say 12 months back), SAP’s offering has been weak in terms of integrating Hadoop. Since then, SAP has provided some ways and technologies of integration.

 

SDA, DT, NLS, and Vora

 

SAP is unfortunately sending mixed messages which confuse customers, developers, and architects with various options such as NLS (which screams “let’s go for IQ, and maybe Hadoop if you really have to”), SDA (Smart Data Access can use Hadoop), Dynamic Tiering (which is SAP’s way of trying to implement a NLS-like data management tool on IQ to remove the need to even consider Hadoop), and of course HANA VORA – which is SAP’s new way of embracing Hadoop.

 

Some of these solutions are heavily scenario-dependent. All of these options come with different limitations and restrictions. What they have in common is that they represent SAP’s struggle to integrate SAP and HANA with the new Big Data world.   In general, to address the topic of data growth within HANA, SAP is pushing the IQ database heavily — both with NLS and DT. Funnily enough, other NLS implementations such as our OutBoard has been supporting various storage types such as Hadoop for 5 years now.

 

Datavard’s NLS implementation OutBoard for Analytics offers native Hadoop support with both HDFS and HIVE implementations plus various additional functions such as mass archiving, the NLS writer, storage management, data ageing, etc. As such, OutBoard is a good first step to pick a low hanging fruit and quickly integrate SAP BW with Hadoop.

 

Three Gaps between SAP and Hadoop

 

However, we see several gaps between the Hadoop-Big Data world and SAP landscapes.

 

There is an IT culture gap: the development strategy for SAP and Hadoop differs fundamentally. SAP landscapes have a multi-tier landscape with development-test-production systems. Hadoop developers work on a local simulated cluster. Data scientists work on production, because that is the only place where real data can be found. There are no software logistics on Hadoop (unless one considers copying JAR files “logistics”).

 

There is a data integration gap: ETL (Extraction, Transformation, Loading) works completely different on SAP than on Hadoop. The integration therefore is difficult, and often involves flat files as intermediate format. Data does not flow natively between the two different worlds. SAP bridged such gaps for its own software decades back with the IDOC standard, with RFC calls, even with complex data migration tools such as LSMW and later on with various middlewares like CIF, the CRM middleware, and the BW extraction. Today, data does not flow natively between Hadoop and SAP. HANA Vora aims at helping with this gap, but more on this later.

 

Finally, there is an operational gap. Hadoop is already pretty good where 24x and high availability is concerned. However, running a Hadoop cluster on a daily basis differs fundamentally from running a SAP landscape with its rigid rules regarding data governance, authorizations, keeping data for 10+ years, software logistics, etc. From a big data world’s perspective, keeping data for 10 years for legal reasons may sound strange. Similarly, the fast paced evolution of the Hadoop ecosystem with new technologies and engines emerging every week appears alien. There are considerable differences in how such systems are operated.

Gaps

SAP HANA Vora

 

HANA Vora is primarily a Java driven query engine for Hadoop (and Spark). One would expect that Vora aims at bridging all the gaps we see between Hadoop, the Java world, and the SAP-ABAP world. Indeed, it helps address the data flow gap. However, Vora today it is a very technical library-style engine which operates on top of Spark. SAP’s real technical strength though has always been to provide tools with structure, to build transparent and solid applications on top.

 

Vora is a good first step, but it is nowhere near what SAP provides in the ABAP world. To make this work, deep technical skills in various technologies from Unix to Java are required. Sadly, this reminds me of the old R/2 days, when customers complained that in order to build anything for R/2 you needed two persons: an assembler programmer, and a business process specialist. With R/3 and ECC, at least sometimes, the business process specialists could help themselves in programming some simple ABAP into a smart form or a user exit. However, apparently nowadays, one needs three or four people: a Java programmer, a data scientist, and still a business process specialist, and maybe a Unix/Hadoop expert. When it comes to the SAP integration, add an ABAP programmer or a SAP BW developer / analyst.

 

Integrating SAP ABAP and Hadoop

 

Therefore, we at Datavard are looking into ways of integrating HANA Vora functionality into ABAP developments and proven technology such as Data Dictionary, TMS, software logistics, BW process chains and BW info providers. This way, we believe, applications can be built easier and faster, with less different experts needed to implement them. Existing and proven technologies such as the SAP TMS and ABAP can be leveraged. The platform we use for this glues the different worlds together – therefore we call it “Glue”.

 

glue

 

Datavard Glue functions as a middleware. It is built in ABAP, but can reach into Hadoop to move data back and forth between SAP and Hadoop (in fact, HIVE / Impala to be precise). Users can create code (e.g. in Piglatin) from within SAPGUI, move it using the SAP TMS from development to production, and run it in Hadoop. Users can monitor Hadoop activities and jobs in Oozie. Glue can push data from SAP into Hadoop, and process it within Hadoop – leveraging the computational power of all Hadoop nodes instead of running on the SAP application server only (as would be the case with SAP’s SDA). Subsequently, Glue will collect the results, and make them available. This can be achieved for example as virtual info providers in SAP BW, or as simple tables within SAP, or as direct update DSOs within SAP BW. Hadoop jobs can be seamlessly embedded in SM37 jobs, or in BW process chains.

 





5 comments on “Glue: Hadoop and Big Data integration with SAP

    • Goetz LEssmann on

      Once Vora is up and running it is a beautiful starting point of course. However, looking at SAP note #2213226 the bridge seems to not be very wide yet. There are numerous problems with integrating on a technical level. I’ll elaborate in a future blog post. Not even Kerberos is supported yet (and this is now half a year after the original blog post).

      Reply
    • Goetz LEssmann on

      Thanks for replying!

      We have high hopes in Vora, but even now – nearly 6 months after my original blog post, Vora is not functioning as a true integration bridge. It works as a beautiful technology on Spark and Hadoop, however does not easily integrate e.g. with the ABAP stack in terms of ETL, and not at all with the STMS.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *