How to Identify SAP Data You Need On Your Data Lake

SAP and Data Lake are a launching pad for Big Data scenarios

Having SAP based data within a data lake is very helpful when it comes to implementing scenarios such as fraud detection, churn reduction, or predictive maintenance. For example, you can identify equipment such as machinery or trucks which should be serviced through their maintenance history (stored in SAP, of course) and sensor data (streaming into your Hadoop cluster). Then, you can generate new maintenance orders automatically in SAP, based on a machine learning / prediction scenario implemented on your cluster in Spark and Scala, using MLLib.

To implement such scenarios, you need to put together several pieces of technology and data. For the infrastructure, you can use any ETL tool to bring SAP data to external targets. However, it may be worthwhile considering using a lightweight and natively integrated solution like Datavard Glue as engine for the integration (I will show in the next blog posts of this series how this works and how SAP customers can derive value from this).

In general, you need to consider several things when preparing data integration along with functional integration of your data lake and SAP:

  1. SAP Data Discovery
  2. Self Service BI and self-service data acquisition
  3. Contextualization of data
  4. Leveraging existing content, such as the BPL to identify SAP tables and relevant fields
  5. Deciding on the source of SAP data: ERP or BW?

SAP Data Discovery

SAP Data Discovery is the process of identifying which data you have available in SAP, and where it is stored. Sometimes, the goal of the overall procedure needs to be considered because you may want to have different types of data and detail level for different scenarios.

Self Service BI

In the recent years we have seen a trend towards self-service BI on data lakes. This is a beautiful concept where, in theory, data scientists and even reporting users can request data from SAP and other data sources by populating a “shopping cart”. The idea behind this is that the data would manifest itself  on the data lake. In real life, there is typically a workflow including human decisions and double checking, e.g. to decide on the relevance of the data request, evaluate data protection and performance implications etc.

Contextualization of data

Contextualization of SAP data is important for data lakes to make sense and interpret SAP data. Data from transactional SAP systems is quite cryptic, with a ton of fields carrying abbreviations, flags, and pointers to other tables. Such data should be translated into user friendly values which can be used directly, without constant lookups into other tables. After all, you will not want to rebuild the complete SAP data model by joining dozens of tables together on Hadoop when querying the data. The technology to be used for tapping into SAP data clearly needs to support such contextualization, which is sometimes also referred to as de-normalization of data.

Leveraging existing content, such as the BPL to identify SAP tables and relevant fields

BPL is short for “Business Process Library”, a collection of various types of content to accelerate the identification of data and data sources, and to rapidly implement such integration scenarios. BPL content includes for example pre-defined data models (including data extraction flows for various scenarios), and content for mapping of SAP field names and data to “Friendly Fields”. Finally, the BPL includes Business Functions which allow for contextualization and implementation of data lookups, without the need for ABAP skills, programming or even knowing table names.

At Datavard, we support our customers finding their way through the jungle of SAP data – both on ERP and SAP BW. We do this by means of the BPL component in Datavard Glue, our solution to integrate SAP and Big Data.

This screenshot shows Glue’s data self-service request application, where SAP users can trigger requests for data being made available on their Data Lakes:

datavard glue sap hadoop data lake

The BPL includes a workflow application which allows for self-service data requests for any item of content in the BPL content.

Deciding on the source of SAP data: ERP or BW?

Finally, there is a decision to make about the source of data. Wouldn’t it be much easier to simply access data through SAP’s Business Warehouse? After all, some of the data discovery, contextualization, and cleansing required is already performed during the load of ERP data to SAP BW. There are advantages and disadvantages involved with using either ERP or BW as source of SAP data. I will be discussing these in the next blog post in this series.