A Strong Trend Towards Integrated Data Lakes

sap data lake datavard

Bringing together different views on data reporting

Business reporting without data coming from SAP systems is hardly possible. However, there are two kinds of professionals in the area of Business Reporting and Data Warehousing: those coming from an SAP background, and those who come from a technical or non-SAP background.

If you’re coming equipped with an SAP background (like myself), you will be very familiar with a complex relational database model which used to be called SAP R/3 (nowadays S/4HANA), you will be very familiar with business processes and with enterprise use cases. If you have a non-SAP background, you will probably know more about streaming data, sensor data, maybe even machine learning, and data lakes.

Personally, I have an SAP background. For all kinds of reporting there used to be strong SAP based solutions. In the old days, this used to be ABAP query, LIS, and CO-PA, followed by the SAP Business Warehouse which of course was a giant leap forward. Nowadays, a similar leap forward is happening in the integration of SAP data and data lakes.

Clearing up doubts about data lakes

When I was looking at the challenge of integrating data lakes with SAP data I was a bit dumbfounded by two strong opinions which struck me as… let’s simply say odd:

  • I found it strange that people were doubting the use of a data lake for reporting purposes.
  • I found it equally strange that some people were even considering using the SAP Business Warehouse or SAP HANA for non-SAP data.

The second scenario is interesting: SAP BW is built for highly structured data, coming from SAP transactional systems. SAP HANA is lightning fast and can handle all kinds of data (including unstructured data). However, using such technologies to basically do the job of a Data Lake cannot be right. TCO, available tools and platform, skills on the market are just some of strong reasons for using SAP data on a data lake (e.g. on the Hadoop of your choice, e.g. Cloudera Hadoop), and not the other way round (load data lake data into BW or HANA). If you are thinking “SAP data should stay within the SAP technology stack”, you may want to consider SAP BDS (formerly Altiscale) or HANA VORA, of course.

How to best work with SAP and non-SAP data

All things considered, there is only one real option: using SAP data generated by transactional Business Processes within data lakes. This is where SAP and non-SAP data should be combined. Such an integrated data lake can be used for advanced data processing, for data marts, and as next generation Business Warehouse.

Hortonworks, recently (much to my delight) acquired by Cloudera, has been propagating such an approach for a while now. This seems like the natural way to go: establish a multi-purpose data lake and use this lake for scenarios ranging from pure data storage (e.g. archived data, aged from SAP), to data science, machine learning & AI driven data computations. Along the way, reporting capabilities are natural must-have to visualize and consume the data.

When it comes to the integration of SAP data and data lakes a typical first step is to use the data lake for simply storing aged SAP data to reduce the overall SAP TCO. This is a good first step, because you provide results fast by “picking the low-hanging fruit”. You can easily use a data lake for ERP archives and aged SAP BW data, for example.

This diagram shows the required connections and building blocks for a modern integration architecture between SAP and Data Lakes:

sap and data lake architecture datavard


This is relatively easy and fast to achieve and provides tangible TCO savings. The positive side-effect is that if you have this kind of integration established, the ground is already prepared for more advanced use cases, because the technology integration is there and you are ready to enrich your data lake with SAP data.

Integration challenges for SAP customers

SAP and data lake integration leaves several challenges for SAP customers:

  1. There is a major knowledge gap between experts in the SAP and the Big Data world. This knowledge gap may be of mainly technical nature for SAP experts when they need to work with a data lake. This is a big challenge of course. In the opposite direction, the challenge is even bigger, because of the nature of customer specific business processes combined with a complex transactional data model in SAP which come on top of technology.
  2. There are major technology gaps which need to be bridged: processing engines, different storage technologies, security, software logistics, and the question of where to run the platform: in the cloud or on-premise.
  3. Finally, there is a challenge to identify and access relevant data from SAP. There are several ways of doing this of course, all with their own pros and cons. You can access ERP directly, you can use SAP BW as a kind of pass-through. You can use a combination of SLT with SAP Data Services, or a 3rd party ETL tool. In all cases, the challenge of identifying the right data and treating it right when it comes to access control, authorizations, and personal data protection remains.

In the following blog posts, I will discuss different details of this technology and integration. I’ll cover all challenges in more details and show how we at Datavard tackle them.