Intro to Apache Atlas

Photo by Samson on Unsplash

Intro to Apache Atlas

Table of contents

We all know that nowadays, organizations do not rely on a single vendor, but vendor but tend to choose service offerings from multiple vendors. For example, enterprises can adopt databases from Oracle, ETL tools from Informatica, data platforms from Cloudera, and so on, each depending on what technology suits them best from both the functionality and cost perspectives. To be able to still build connected governance for all these heterogeneous services, they need something like Apache Atlas

Apache Atlas is a tool used for Metadata Management and Data Governance, which helps in tracking and managing mutations to the metadata of data sets. It provides a solution for collecting, processing, storing, and maintaining metadata about data objects. It also exposes a rich REST interface for a multitude of operations, for example, creating object types using REST calls.

In Atlas, incoming data can be classified as public or private (internal and confidential), and it can also be categorized based on patterns and regular expressions, which can aid in determining whether the data belongs to a specific category. For For example, classifications can be built based on numerous patterns of a phone number, zip code, vehicle registration number, and so on.

Atlas is also able to handle metadata about object relationships, thus providing powerful lineage capabilities as well. Data Lineage illustrates the origin, movement, transformation, and destination of data. It describes and depicts the life cycle of data, beginning with where the data originated and progressed through the system(s), what transformations it went through, and where it finally settled. As seen in the picture below, data lineage essentially gives a map of the data journey that includes all steps along the route.

image.png

Figure 1: Showing data lineage of a file

In the example above, the employee data is loaded from a Hive database and is then segregated based on their location, i.e.location, i.e.i.e., us_employees and uk_employees. Once the data is segregated, two different views are created, one for each of these two groups. All these are shown graphically in the data lineage diagram, and diagrams and can be easily interpreted. Understanding the origin of data sources is beneficial for various reasons:

  1. Assessing the dependability of data based on its provenance.
  2. Recognizing and fixing error causes.
  3. Recognizing improper data assumptions which may affect the analysis.
  4. Maintaining audit trails for data governance and regulatory compliance.
  5. Ensuring that data transfers are secure and unaffected by changes.
  6. Identifying and minimizing data redundancy in order to streamline processes and cut costs.

Apache Atlas Architecture

Atlas Components are classified into 4 four major categories: — Core, Integration, MetaData Sources and Applications (Apps).

image.png

Figure 2: Atlas High-Level Architecture — Overview

Atlas Core — Central component, which interfaces with the backend layer and feeds metadata and lineage information into the Atlas Database. Atlas stores info in the backend using the HBase database and indexes using Apache Solr. The core component consists primarily of a Type System, which allows users to build and manage types and entities, a

Graph Engine, which handles relationships between Metadata Objects, and Ingest/Export, which adds Metadata and raises an event in the case of a change in metadata. Integration - Users utilize this Layer to connect with Atlas. Atlas users can manage metadata in two ways:

  1. API: Atlas's entire functionality is provided to end users via a REST API, which allows types and entities to be created, changed, and destroyed.
  2. Messaging: In addition to the API, users can interact with Atlas via a Kafka based messaging interface

Metadata Sources - This layer contains data sources that are supported by default in Apache Atlas. This means that if the user has a data source of any such type, they can right away begin recording their metadata in Atlas via REST APIs. Atlas natively supports the following data sources: HBase, Hive, Sqoop, Storm, and Kafka.