AWS Cloud Data Analytics Guide

Software
Cloud Computing
Data Science
Published

March 13, 2024

What is Data Analytics

There are a few primary definitions:

  • Analysis is detailed examination of something in order to understand its nature of determine its essentials features / properties.
  • Data Analysis is the process of compiling, processing and analyzing data to perform the analysis, and use the analysis extracted properties to make a business decision.
    • Meta analysis is, in contrast, analysis performed through collection, processing and analyzing existing researches.
  • Analytics is a systematic analysis of the data (i.e., a framework of how to perform data analysis)
    • It tells you what to look for
    • What step should you do next
    • How to ensure the next step you are doing is scientifically valid and meaningful, so that logical conclusion can be drawn.
  • Data Analytics is the specific analytical process (i.e., one particular choice of framework) which is being applied.

We should give a clearer definition here.

  • Data Analysis is to analyze the data to derive meaningful insights from the data. It works one a single dataset.
  • Data Analytics is broader, it performs multiple data analysis techniques on different data, explore their relationships and finally provides some business value.

Thus, an effective data analytics solution should combine 3 main features:

  • Scalability: Analyze small or vast data sets of different types.
  • Speed: Analysis performed in near real time or fast as new data arrives.
  • Goal: Analysis should be good enough to be able to derive business decisions from and yield high value returns.

Thus, a component of data analysis solution require 4 parts.

  • Ingest or Collect the data in various forms.
  • Store the data as necessary.
  • Process or analyze the data in fast speed.
  • Consume the analyzed or processed data in terms of query / dashboard / reports to provide business value or insights.

Types of Data Analytics

Data Analytics come in 4 different forms.

  • Descriptive - It is an exploratory data analysis part, performs summarization using charts, etc. It answers the question “What happened?”
  • Diagnostic - It aims to answer the question “Why it happened?”. It looks for past historical data, and compares and finds patterns and disparities, and performs correlation analysis.
  • Predictive - It aims to answer “What might happen in future?”. Basically, it uses time series analysis to discover historical trend, and predict the future trends.
  • Prescriptive - It recommends actionable insights to the stakeholders based on the predictive analytics results, it answers questions like “What should we do to make the future look better?”

The Five V’s of Big Data

As businesses begin to implement data analysis solutions, challenges arise. These challenges are based on the characteristics of the data and analytics required for their use case. These challenges come in various forms, in five Vs.

  • Volume: The data could be extremely large.
  • Variety: The data may come in various forms such as structured, semi-structured or unstructured form, from different sources.
  • Velocity: The data must be process fast with low latency, in near real time.
  • Veracity: The data must be validated for accuracy, and any outlier must be removed or handled properly. The solution should be able to fix the errors if possible.
  • Value: The processed and analyzed data must be able to provide business insights and value for decision making.

Most of the machine learning system, focuses only on the veracity anhd value part. But in AWS, we should be able to provide a comprehensive treatment using cloud machineries.

Not all organizations experience challenges in every area. Some organizations struggle with ingesting large volumes of data rapidly. Others struggle with processing massive volumes of data to produce new predictive insights. Still, others have users that need to perform detailed data analysis on the fly over enormous data sets. Before beginning your data analysis solution, you must first check which of these 5 Vs are present in the business problem and design your solution accordingly.

Volume

One of the first things we require for performing data analysis is storage. Global data creation is projected to grow to 180 zettabytes (1000 GB = 1 petabyte, 1000PB = 1 exabyte, 1000 EB = 1 zetabyte) by 2025.

There are usually 3 kinds of data you will have in terms of storage burdening.

  • Transactional data - Very important, like your user details, customer details, purchases, etc.
  • Temporary data - moves you make in a game, website scrolling data, Browser cache, etc.
  • Objects - Emails, text messages, social media content, videos, etc. that you cannot store as transactional, but require an object level storage.

There are mainly 3 types of data in terms of storage schema.

  • Structured data is organized and are stored in the form of rows and columns of tables.
    • Examples are Relational databases
  • Semi-structured data is like a key-value pair, with proper groupings / tags inside a file.
    • Examples include CSV, XML, JSON, etc.
  • Unstructured data is data without a specific consistent structure.
    • Logs files
    • Text files
    • Audio or video, etc.

Most of the data present in business is unstructured.

AWS S3

The most popular, versatile solution for storage in AWS is Simple Storage Service (S3). AWS S3 is a file system like object storage. Basically, it is like a key-value store:

  • Keys are the paths of the file / object.
  • Value is the object data itself.

Every put call to the existing overlapping key will update the entire object at once, with no block level updation.

S3 is highly available and highly durable storage.

  • It has unlimited scalability.
  • It is natively online, accessed over HTTP requests, so multiple machines can access the data simultaneously.
  • AWS provides built in encryption security.
  • 99.(9 many 9s)% durability.

In AWS S3, there are a few concepts.

  • Buckets are the first concept. It is like system disk. Each bucket may contain different objects and are used for different purposes. The access pattern of the objects are configured at the bucket level.
  • In each object stored inside a bucket, there is an associated metadata. The metadata is simply the prefix (the folder path) in the bucket, the namee of the object (or the object key) and a version number.

An object key is the unique identifier for an object in a bucket. Because the combination of a bucket, key, and version ID uniquely identifies each object, you can think of Amazon S3 as a basic data map between “bucket + key + version” and the object itself. Every object in Amazon S3 can be uniquely addressed through the combination of the web service endpoint, bucket name, key, and (optionally) version.

Using S3 provides many benefits, few are:

  • It makes the storage decoupled from the processing / compute nodes.
  • It is a centralized place for all of your data.
  • AWS provides built in integration with clusterless and serverless computing.
  • S3 has a standard API using which you can access objects like network requests.
  • Built in security and encryption.

Types of Data Stores

In Data Analytics literature, there are a few types of data stores based on their purpose.

  • Data Lake is the storage of data where you can store unstructured, structured or semi-structured data for almost everything. Basically all of your business application dumps their data into a data lake. Starting from data lake, we add ingestion and other processing logic to convert this data into something useful.
    • A data lake is a centralized repository that allows you to store structured, semistructured, and unstructured data at any scale.
    • It gives you a single source of truth for any data.
    • The data lake should also contain an index or tags for the data, producing a catalogue for all data that exists in the data lake.
    • To implement data lake, one can use AWS S3.
      • Integration with AWS Glue can provide the metadata and cataloguing service for all data present in S3.
      • One can also use AWS LakeFormation (currently in preview) to set up a secure data lake in days. It also performs analysis using Machine Learning to understand the structure / format of the data, and create schema or field description for cataloguing purpose.

Data lakes are an increasingly popular way to store and analyze both structured and unstructured data. If you want to build your own custom Amazon S3 data lake, AWS Glue can make all your data immediately available for analytics without moving the data.

  • Data Warehouse is a central repository to contain all your structured data from many different sources. These data are processed and ready to be ingested to business reporting tools and decision making.
    • This data is kept transformed, cleaned, aggregated and prepared beforehand using data analytics tools.
    • The data is structured here, using a relational database. Hence, before storing the data, one must define the schema and the constraints on the stored data.
    • These are kept in an efficient way to minimize read operations and deliver query results at blazing speeds to thousands of users concurrently.
    • You can use AWS Redshift for a data warehouse solution in AWS.
      • It is 10x faster than existing other data warehouse solutions.
      • Easy to setup, deploy and manage.
      • AWS provides built in security.

  • Data Mart is a subset of data warehouse specific to a particular department or part of your organization. This is a restricted access subset of your data warehouse. Small transformations are allowed for creation of data mart.

Characteristics Data Warehouse Data Lake
Data Relational data from transactional systems, operational databases, and line-of-business applications Relational and non-relational data from IoT devices, websites, mobile apps, social media, and enterprise applications
Schema Defined before loading (schema-on-write) Applied during analysis (schema-on-read)
Price / Performance Fastest query performance using higher-cost storage Improving query performance using low-cost storage
Data Quality Highly curated, serving as the trusted single source of truth Any data, raw or curated
Users Business analysts Data scientists, data engineers, and analysts (with curated data)
Analytics Batch reporting, BI, dashboards, and visualizations Machine learning, predictive analytics, data exploration, and profiling

Amazon EMRFS

Amazon EMR provides an alternative to HDFS: the EMR File System (EMRFS). EMRFS can help ensure that there is a persistent “source of truth” for HDFS data stored in Amazon S3. When implementing EMRFS, there is no need to copy data into the cluster before transforming and analyzing the data as with HDFS. EMRFS can catalogue data within a data lake on Amazon S3. The time that is saved by eliminating the copy step can dramatically improve the performance of the cluster.

Velocity

When businesses need rapid insights from the data collected, they have a velocity requirement.

Data Processing in the context of the velocity problem means two things.

  • Data collection must be rapid so that systems can ingest high volume of data or data streams rapidly.
  • Data Processing must be fast to provide insights for the required level of latency.

Types of Data Processing

There are 2 major types of data processing.

  • Batch processing - Refers to the system when the data collection may be done separately from the data analysis tasks. Once a certain amount of data is collected, then only data analysis tasks kick in.
    • Scheduled Batch processing - means it runs on a specific regular schedule, and has predictable workloads.
    • Periodic Batch processing - means it runs only when a certain amount of data is collected, which may not be on a regular interval. So the workload is unpredictable.
  • Stream processing - Refers to the system when data collection and analysis is usually coupled. The insights must be delivered very fast as the data is consumed.
    • Real-time - Insights must be delivered within milliseconds, like autonomous cars.
    • Near real-time - Insights must be delivered within minutes.
Batch Data Processing Stream Data Processing
Data scope Processes all or most of the data in a dataset Processes data in a rolling time window or only the most recent records
Data size Large batches of data Individual records or micro-batches of a few records
Latency Minutes to hours Seconds or milliseconds
Analysis Complex analytics Simple response functions, aggregates, rolling metrics

Distributed Processing

There are many distributed processing frameworks. The most popular ones are

  • Apache Hadoop
  • Apache Spark

Apache Hadoop

Apache Hadoop is a distributed computing system that is designed on the principle of delegating the data processing workload to several servers, and workload configuration managed by a single master server. Each of these servers are called nodes.

At its core, Hadoop implements a specialized file system called Hadoop Distributed File System (HDFS), that stores data across multiple nodes, making redundant data replication to make it failsafe.

In Hadoop, most of the servers are configured as Data Nodes, these process and stores the data. And the few nodes will be NameNodes, these are where the memory management happens, it stores the maps of where the actual data resides.

Any client connects the NameNode for file metadata or file modifications.

But performs the I/O of actual file updation in DataNodes.

Let’s say, a client issues a write request to Name Node. The node performs initial checks to make sure the file already does not exist, and the client has the permissions. Then it returns the location where to write each block of the data.

For each block, client performs I/O directly to that block.

As soon as the client finishes writing the data block, the Data Node starts copying the data block to another Data Node for redundancy. Note that, client is not involved anymore, so the copying happens internally.

The name nodes will contain the mapping of all the data nodes where the particular data block resides.

Hadoop comprises 4 modules.

  • HDFS - The Hadoop’s native file system.
  • Yet Another Resource Navigator (YARN) - a module for scheduling tasks in the Name Nodes.
  • Hadoop Map Reduce - The main computing module that allows programs to break large data processing tasks into smaller ones and run them in parallel on multiple servers.
  • Hadoop Common or Hadoop Core - The core java libraries

Hadoop provides the following features:

  • It has built in security, runs encryption on data in transit and at rest.
  • The map reduce module only has simple tasks, machine learning cannot run.
  • Hadoop uses external memory, so integrates well with network based storage like AWS S3. This is provided in AWS EMR (Elastic MapReduce)
  • Since Hadoop uses network memory access, usually slow, but capable to large amount of scaling.
  • Hadoop is very affordable for processing big data.
  • Hadoop is used with batch processing mostly.

Apache Spark

Spark works on the principle of accessing local memory in RAM instead of using network storage. Spark does not provide any native file system, you can integrate it with any distributed file system.

Spark has the following components:

  • Spark Core contains basic functionalities like memory management, task scheduling, etc.
  • Spark SQL allows you to process data in Spark’s distributed storage.
  • Spark streaming and Structured streaming allow Spark to stream data efficiently in real-time by separating data into tiny continuous blocks.
  • Machine Learning Library (MLlib) provides built in machine learning algorithms with distributed algorithms.
  • GraphX allows you to visualize and analyze data with graphs.

Following are the few features:

  • Spark stores and process data on internal memory.
  • Usually more expensive that Hadoop.
  • Scaling is difficult, and requires vertical scaling by adding more RAM.
  • Has built-in machine learning algorithm.
  • Spark has only basic security, it is not opinionated about that.
  • Processes data very fast, ideal for the real-time or near real-time processing workloads.

Batch Data Processing

For batch data processing, we can use AWS Elastic MapReduce (AWS EMR) with Hadoop and Apache Spark integrated on top of it.

So, this could be one architecture for a data analytics solution.

The architecture diagram below depicts the same data flow as above but uses AWS Glue for aggregated ETL (heavy lifting, consolidated transformation, and loading engine). AWS Glue is a fully managed service, as opposed to Amazon EMR, which requires management and configuration of all of the components within the service.