StreamTech Knowledge

/
/

Data Engineering on Microsoft Azure

Course Code

DP-203T00

Level

Intermediate

Duration

4 Days

Revision

DP-203T00

Overview

This course teaches data engineers how to design and implement data solutions on Azure. Students will learn how to ingest, transform, serve, and analyze data using Azure services like Azure Data Lake Storage Gen2, Azure Synapse Analytics, Azure Databricks, Azure Data Factory, and Azure Stream Analytics. The course emphasizes best practices for building scalable, secure, and reliable data solutions.

Audience

Data Engineers, Data Architects, and individuals responsible for building and maintaining data solutions on Azure.

Prerequisites

  • Basic understanding of data concepts and data processing.
  • Familiarity with cloud computing concepts.
  • Experience with at least one programming language (e.g., Python, Scala, SQL).
  • Basic understanding of Azure services is beneficial but not strictly required.

Outline

Module 1: Introduction to Azure Data Engineering

  • Overview of Azure Data Engineering: Covers the role of a data engineer, key concepts, and the Azure data engineering landscape.
  • Core Data Engineering Concepts: Discusses data ingestion, transformation, storage, and serving, along with relevant design patterns.
  • Introduction to Azure Data Services: Briefly introduces the key Azure services used in data engineering solutions, such as Azure Data Lake Storage Gen2, Azure Synapse Analytics, Azure Databricks, Azure Data Factory, and Azure Stream Analytics.
  • Setting up the Azure Environment: Guides students through setting up an Azure subscription and configuring the necessary resources for the course labs.

Module 2: Working with Azure Data Lake Storage Gen2

  • Introduction to Azure Data Lake Storage Gen2: Explains the benefits and features of ADLS Gen2, including its hierarchical namespace and integration with other Azure services.
  • Securing Azure Data Lake Storage Gen2: Covers access control mechanisms, including Azure RBAC and ACLs, to secure data within the data lake.
  • Developing with Azure Data Lake Storage Gen2: Demonstrates how to programmatically interact with ADLS Gen2 using different SDKs and APIs.
  • Working with Data in Azure Data Lake Storage Gen2: Explores techniques for storing, retrieving, and processing various data formats (e.g., Parquet, Avro, CSV) within ADLS Gen2.

Module 3: Building Batch Data Pipelines with Azure Data Factory

  • Introduction to Azure Data Factory: Explains the capabilities of Azure Data Factory for building and orchestrating data pipelines.
  • Creating Pipelines and Activities: Covers how to create pipelines, define activities (e.g., Copy Data, Data Flow), and link them together.
  • Working with Datasets and Linked Services: Explains how to define datasets representing data sources and sinks, and how to create linked services to connect to those sources.
  • Monitoring and Managing Pipelines: Demonstrates how to monitor pipeline execution, troubleshoot errors, and manage pipeline schedules.
  • Data Flows in Azure Data Factory: Deep dives into using Data Flows for visually building data transformation logic without writing code.

Module 4: Building Real-Time Data Pipelines with Azure Stream Analytics

  • Introduction to Azure Stream Analytics: Explains the capabilities of Azure Stream Analytics for processing real-time streaming data.
  • Developing Stream Analytics Queries: Covers the Stream Analytics Query Language (SAQL) for filtering, aggregating, and transforming streaming data.
  • Ingesting Data into Stream Analytics: Demonstrates how to connect Stream Analytics to various data sources, such as Event Hubs and IoT Hub.
  • Outputting Data from Stream Analytics: Explains how to send processed streaming data to different destinations, such as Azure SQL Database and Power BI.

Module 5: Building the Data Warehouse with Azure Synapse Analytics

  • Introduction to Azure Synapse Analytics: Explains the architecture and components of Azure Synapse Analytics, including dedicated SQL pools and serverless SQL pools.
  • Developing with Dedicated SQL Pools: Covers how to create and manage dedicated SQL pools, load data into them, and query the data using T-SQL.
  • Working with Serverless SQL Pools: Explains how to use serverless SQL pools to query data in various formats stored in Azure Data Lake Storage Gen2 without provisioning dedicated resources.
  • Data Modeling for Azure Synapse Analytics: Discusses best practices for designing data models for analytical workloads in Azure Synapse Analytics.

Module 6: Integrating Data with Azure Databricks

  • Introduction to Azure Databricks: Explains the capabilities of Azure Databricks for data processing and analysis using Apache Spark.
  • Working with Spark in Azure Databricks: Covers how to use Spark APIs (e.g., Python, Scala, SQL) to process and transform data.
  • Data Engineering with Databricks: Demonstrates how to use Databricks for various data engineering tasks, such as data cleansing, transformation, and enrichment.
  • Integrating Databricks with other Azure Services: Explains how to connect Databricks to other Azure services, such as Azure Data Lake Storage Gen2 and Azure Synapse Analytics.

Module 7: Orchestrating Data Engineering Workloads with Azure Data Factory

  • Advanced Data Factory Concepts: Covers more advanced topics in Azure Data Factory, such as control flow activities, parameters, and variables.
  • Building End-to-End Data Pipelines: Walks through the process of building complete data pipelines that integrate various Azure services.
  • Best Practices for Data Factory Development: Discusses best practices for designing, developing, and deploying data pipelines in Azure Data Factory.

Module 8: Monitoring and Managing Azure Data Solutions

  • Monitoring Data Pipelines: Covers techniques for monitoring the performance and health of data pipelines using Azure Monitor and other tools.
  • Managing Data Security: Discusses best practices for securing data in Azure data solutions, including access control, encryption, and auditing.
  • Troubleshooting Data Solutions: Explains how to troubleshoot common issues that may arise in data engineering solutions.

Module 9: Designing and Implementing Data Solutions on Azure

  • Data Solution Architectures: Discusses common data solution architectures and design patterns on Azure.
  • Designing for Scalability and Performance: Covers best practices for designing scalable and performant data solutions.
  • Designing for Security and Compliance: Explains how to design data solutions that meet security and compliance requirements.