Databricks Data Engineering online training → What is Data Engineering & Role of a Data Engineer

Back

Courses
Pages
Blog

Contact us:
(+88) 1990 6886
contact@thimpress.com
Demo account

Login

Courses
Pages
Blog

20 Sections
164 Lessons
8 Weeks

Expand all sectionsCollapse all sections

Module 1: Data Engineering Fundamentals
9
- 1.1
  What is Data Engineering & Role of a Data Engineer
- 1.2
  OLTP vs OLAP Systems
- 1.3
  Data Warehouse vs Data Lake vs Lakehouse
- 1.4
  Batch Processing vs Stream Processing
- 1.5
  Modern Data Engineering Architecture
- 1.6
  Data Engineering Lifecycle (Ingestion → Storage → Processing → Serving)
- 1.7
  Medallion Architecture (Bronze → Silver → Gold)
- 1.8
  Data Modeling Basics — Star Schema & Snowflake Schema
- 1.9
  File Formats — CSV, JSON, Parquet, Avro, ORC, Delta, Iceberg
Module 2: Databricks Platform Fundamentals
10
- 2.1
  Databricks Workspace Overview
- 2.2
  Databricks Architecture — Control Plane vs Data Plane
- 2.3
  Workspace Components — Notebooks, Clusters, Jobs, Repos
- 2.4
  Creating and Managing Clusters
- 2.5
  Cluster Types — All-Purpose vs Job Clusters
- 2.6
  Databricks Runtime — Standard & ML
- 2.7
  Databricks Lakehouse Platform Overview
- 2.8
  Cluster Policies & Auto-Termination
- 2.9
  Databricks Utilities (dbutils) — File, Secrets, Widgets
- 2.10
  Notebook Collaboration & Magic Commands (%sql, %md, %sh)
Module 3: PySpark Fundamentals
8
- 3.1
  Introduction to Apache Spark
- 3.2
  Spark Architecture — Driver, Executors, Cluster Manager
- 3.3
  SparkSession vs SparkContext
- 3.4
  RDD vs DataFrame vs Dataset
- 3.5
  Lazy Evaluation & DAG (Directed Acyclic Graph)
- 3.6
  Spark Execution Plan — Logical vs Physical Plan
- 3.7
  Transformations vs Actions — Deep Dive
- 3.8
  Reading Data from DBFS (Databricks File System)
Module 4: PySpark DataFrame Operations
10
- 4.1
  DataFrame Transformations — select, filter, withColumn, drop
- 4.2
  Diffferent ways to create DataFrames
- 4.3
  Reading Data — CSV, JSON, Parquet, Delta
- 4.4
  Writing Data — Overwrite, Append, Partitioned Writes
- 4.5
  RDD Transformation & actions
- 4.6
  Schema Definition — StructType & StructField
- 4.7
  InferSchema vs Defined Schema — Best Practices
- 4.8
  Working with Nested complex JSON data & Array Columns
- 4.9
  Spark Date & Window functions
- 4.10
  Important functions like explode(), flatten(), struct(), udf
Module 5: PySpark Data Transformations
7
- 5.1
  Handling Null Values — dropna(), fillna(), coalesce()
- 5.2
  Date and Timestamp Functions
- 5.3
  Aggregations — groupBy, sum, avg,
- 5.4
  Type Casting & Data Type Conversions
- 5.5
  User Defined Functions (UDFs) & Pandas UDFs
- 5.6
  Pivot & Unpivot Operations
- 5.7
  Working with Multiple DataFrames — union, unionByName
Module 6: Joins and Window Functions
7
- 6.1
  Types of Joins — Inner, Left, Right, Full
- 6.2
  Window Functions — row_number, rank, lead, lag
- 6.3
  Cross Join & Self Join Use Cases
- 6.4
  Optimize joins: broadcast, sortmerge join
- 6.5
  Handling Duplicate Records After Joins
- 6.6
  dense_rank(), ntile(), percent_rank()
- 6.7
  Running Totals & Moving Averages with Window Functions
Module 7: Delta Lake Fundamentals
9
- 7.1
  What is Delta Lake & Why It Matters
- 7.2
  Delta Operations — Update, Delete, Merge (Upsert)
- 7.3
  Delta Lake Architecture & Transaction Log
- 7.4
  ACID Transactions (Update, delete)
- 7.5
  Delta Table Creation & Convert Parquet to Delta
- 7.6
  Delta Lake vs Apache Iceberg vs parquet— Comparison
- 7.7
  Managed vs External Delta Tables pros, cons
- 7.8
  Schema Enforcement vs Schema Evolution
- 7.9
  Writing Idempotent Pipelines with Delta
Module 8: Advanced Delta Lake
9
- 8.1
  Time Travel — Query Historical Versions
- 8.2
  Vacuum — Removing Old Files
- 8.3
  Delta Table Optimization — OPTIMIZE & Z-Ordering
- 8.4
  Change Data Feed (CDF) vs Change Data Capture (CDC)
- 8.5
  Liquid Clustering (Latest Databricks Feature)
- 8.6
  Deletion Vectors for Faster Deletes
- 8.7
  Row-Level Concurrency
- 8.8
  Medallion architecture with Delta (Bronze → Silver → Gold)
- 8.9
  Auto Loader with Delta — Incremental File Ingestion
Module 9: Data Engineering Pipelines
9
- 9.1
  Batch Data Pipelines
- 9.2
  Incremental Data Processing
- 9.3
  SCD Type 1 & Type 2 Implementation
- 9.4
  ETL vs ELT
- 9.5
  Auto Loader — cloudFiles() for S3/ADLS
- 9.6
  Watermarking for Late-Arriving Data
- 9.7
  Idempotent & Fault-Tolerant Pipeline Design
- 9.8
  Full Load vs Incremental Load Strategies
- 9.9
  Data Quality Checks in Pipelines
Module 10: Delta Live Tables (DLT)
7
- 10.1
  What is Delta Live Tables
- 10.2
  DLT Pipeline Architecture
- 10.3
  Creating Streaming & Batch DLT Tables
- 10.4
  Triggered vs Continuous Pipeline Modes
- 10.5
  Data Quality with Expectations (@dlt.expect)
- 10.6
  Monitoring DLT Pipeline Runs
- 10.7
  DLT with Medallion Architecture — End to End
Module 11: Databricks SQL
8
- 11.1
  SQL Warehouses — Serverless vs Classic
- 11.2
  Running SQL Queries
- 11.3
  Creating Views & Materialized Views
- 11.4
  Query Optimization
- 11.5
  Databricks SQL Dashboards & Visualisations
- 11.6
  Query History & Query Profile Analysis
- 11.7
  Databricks SQL Alerts
- 11.8
  Connecting BI Tools — Power BI, Tableau to Databricks SQL
Module 12: Spark Performance Optimization
11
- 12.1
  Partitioning Strategy
- 12.2
  Repartition vs Coalesce – Where to use?
- 12.3
  Broadcast Joins – Different usecases
- 12.4
  Caching and Persistence
- 12.5
  Memory Management
- 12.6
  Executor memory, Driver Memory , cores -properly use
- 12.7
  Skew Handling — Salting Technique
- 12.8
  Reading Spark UI — Jobs, Stages, Tasks
- 12.9
  Spill to Disk — Causes & Fixes
- 12.10
  File Size Optimization — Small File Problem
- 12.11
  Predicate Pushdown & Column Pruning
Module 13: Structured Streaming
9
- 13.1
  Introduction to Streaming Concepts
- 13.2
  Batch vs Streaming Architecture difference
- 13.3
  Reading Streaming Data — with Kafka, Files
- 13.4
  Writing Streaming Data — Delta, Kafka
- 13.5
  Trigger Modes — Once, Fixed Interval, Continuous
- 13.6
  Stateful vs Stateless Streaming
- 13.7
  Checkpointing & Fault Recovery
- 13.8
  Streaming with Auto Loader
- 13.9
  Practical: Real-Time Order Processing Pipeline
Module 14: Databricks Workflow Orchestration
9
- 14.1
  Databricks Jobs & Scheduling
- 14.2
  Multi-Task Workflows
- 14.3
  Error Handling & Retries
- 14.4
  Job Clusters vs All-Purpose Clusters in Jobs
- 14.5
  Parameterised Jobs with Widgets
- 14.6
  Task Dependencies — Sequential & Parallel
- 14.7
  Email & Webhook Notifications on Job Failure
- 14.8
  Monitoring Jobs via Job Run History
- 14.9
  Integrating with Apache Airflow (Overview)
Module 15: Unity Catalog & Data Governance
9
- 15.1
  What is Unity Catalog
- 15.2
  Data Governance fundamental Concepts
- 15.3
  Access Control — Table Level & Column Level
- 15.4
  Three-Level Namespace — Catalog → Schema → Table
- 15.5
  Row-Level Security with Row Filters
- 15.6
  Data Lineage Tracking
- 15.7
  Tagging & Data Classification
- 15.8
  Audit Logs in Unity Catalog
- 15.9
  External Locations & Storage Credentials
Module 16: Databricks with Cloud Platforms
7
- 16.1
  Databricks on AWS — S3 Integration, IAM Roles
- 16.2
  Databricks on Azure — ADLS Integration, Service Principals
- 16.3
  Databricks on GCP — Overview
- 16.4
  AWS Glue vs Databricks — When to Use What
- 16.5
  Azure Data Factory + Databricks Integration
- 16.6
  Secrets Management — AWS Secrets Manager / Azure Key Vault
- 16.7
  Mounting Cloud Storage in Databricks
Module 17: Real-Time Data Engineering
7
- 17.1
  Kafka Integration
- 17.2
  Kafka Architecture — Topics, Partitions, Consumer Groups
- 17.3
  Kafka Vs Confluence kafka
- 17.4
  Producing & Consuming Messages from Databricks
- 17.5
  Exactly-Once Semantics with Kafka + Delta
- 17.6
  Practical: Real-Time Clickstream Analytics
- 17.7
  Practical: cdata Rest api, Nifi connect Databricks
Module 18: CI/CD and Production Deployment
7
- 18.1
  Git Integration with Databricks Repos
- 18.2
  CI/CD Pipelines
- 18.3
  Databricks Asset Bundles (DAB) — Latest Feature
- 18.4
  GitHub Actions + Databricks Workflow
- 18.5
  Environment Management — Dev → QA → Prod
- 18.6
  Logging & Monitoring in Production
- 18.7
  Notebook Testing with Great Expectations
Module 19: End-to-End Data Engineering Projects
5
- 19.1
  Project 1 — Batch Pipeline with Medallion Architecture
- 19.2
  Project 2 — Real-Time Streaming Pipeline with Kafka
- 19.3
  Project 3 — SCD Type 2 Pipeline with Delta Lake
- 19.4
  Project 4 — Delta Live Tables End-to-End Pipeline
- 19.5
  Project 5 — Cloud Integration Project (AWS / Azure)
Module 20: Databricks Certification & Interview Preparation
7
- 20.1
  Practice Questions — Full Mock Tests
- 20.2
  Databricks Certified Data Engineer Associate — Exam Overview
- 20.3
  Interview tips & Resume Preparation
- 20.4
  Top 50 Databricks Interview Questions & Answers
- 20.5
  Generative AI (github copilot) for code generation
- 20.6
  Claude AI for code Generation
- 20.7
  Linkedin tips to find job & get a job

This content is protected, please login and enroll in the course to view this content!

OLTP vs OLAP Systems

800 388 80 90
58 Howard Street #2 San Francisco
contact@eduma.com

Company

About
Blog
Contact
Become a Teacher

Links

Courses
Events
Gallery
FAQs

Support

Documentation
Forums
Language Packs
Release Status

Recommend

WordPress
LearnPress
WooCommerce
bbPress

Premium LMS & Online Education WordPress Theme

Privacy
Terms
Sitemap
Purchase

Become an instructor?

Join thousand of instructors and earn money hassle free!

Get started now

Login with your site account

Lost your password?

Remember Me

Modal title

Main Content

Big Data and PySpark Online Training

Snowflake and dbt Online Training

GenAI for Data Engineers Online Training

Apache Airflow Online Training

Landing page for online courses

Landing page for offline courses