- 20 Sections
- 164 Lessons
- 8 Weeks
Expand all sectionsCollapse all sections
- Module 1: Data Engineering Fundamentals9
- 1.1What is Data Engineering & Role of a Data Engineer
- 1.2OLTP vs OLAP Systems
- 1.3Data Warehouse vs Data Lake vs Lakehouse
- 1.4Batch Processing vs Stream Processing
- 1.5Modern Data Engineering Architecture
- 1.6Data Engineering Lifecycle (Ingestion → Storage → Processing → Serving)
- 1.7Medallion Architecture (Bronze → Silver → Gold)
- 1.8Data Modeling Basics — Star Schema & Snowflake Schema
- 1.9File Formats — CSV, JSON, Parquet, Avro, ORC, Delta, Iceberg
- Module 2: Databricks Platform Fundamentals10
- 2.1Databricks Workspace Overview
- 2.2Databricks Architecture — Control Plane vs Data Plane
- 2.3Workspace Components — Notebooks, Clusters, Jobs, Repos
- 2.4Creating and Managing Clusters
- 2.5Cluster Types — All-Purpose vs Job Clusters
- 2.6Databricks Runtime — Standard & ML
- 2.7Databricks Lakehouse Platform Overview
- 2.8Cluster Policies & Auto-Termination
- 2.9Databricks Utilities (dbutils) — File, Secrets, Widgets
- 2.10Notebook Collaboration & Magic Commands (%sql, %md, %sh)
- Module 3: PySpark Fundamentals8
- 3.1Introduction to Apache Spark
- 3.2Spark Architecture — Driver, Executors, Cluster Manager
- 3.3SparkSession vs SparkContext
- 3.4RDD vs DataFrame vs Dataset
- 3.5Lazy Evaluation & DAG (Directed Acyclic Graph)
- 3.6Spark Execution Plan — Logical vs Physical Plan
- 3.7Transformations vs Actions — Deep Dive
- 3.8Reading Data from DBFS (Databricks File System)
- Module 4: PySpark DataFrame Operations10
- 4.1DataFrame Transformations — select, filter, withColumn, drop
- 4.2Diffferent ways to create DataFrames
- 4.3Reading Data — CSV, JSON, Parquet, Delta
- 4.4Writing Data — Overwrite, Append, Partitioned Writes
- 4.5RDD Transformation & actions
- 4.6Schema Definition — StructType & StructField
- 4.7InferSchema vs Defined Schema — Best Practices
- 4.8Working with Nested complex JSON data & Array Columns
- 4.9Spark Date & Window functions
- 4.10Important functions like explode(), flatten(), struct(), udf
- Module 5: PySpark Data Transformations7
- Module 6: Joins and Window Functions7
- 6.1Types of Joins — Inner, Left, Right, Full
- 6.2Window Functions — row_number, rank, lead, lag
- 6.3Cross Join & Self Join Use Cases
- 6.4Optimize joins: broadcast, sortmerge join
- 6.5Handling Duplicate Records After Joins
- 6.6dense_rank(), ntile(), percent_rank()
- 6.7Running Totals & Moving Averages with Window Functions
- Module 7: Delta Lake Fundamentals9
- 7.1What is Delta Lake & Why It Matters
- 7.2Delta Operations — Update, Delete, Merge (Upsert)
- 7.3Delta Lake Architecture & Transaction Log
- 7.4ACID Transactions (Update, delete)
- 7.5Delta Table Creation & Convert Parquet to Delta
- 7.6Delta Lake vs Apache Iceberg vs parquet— Comparison
- 7.7Managed vs External Delta Tables pros, cons
- 7.8Schema Enforcement vs Schema Evolution
- 7.9Writing Idempotent Pipelines with Delta
- Module 8: Advanced Delta Lake9
- 8.1Time Travel — Query Historical Versions
- 8.2Vacuum — Removing Old Files
- 8.3Delta Table Optimization — OPTIMIZE & Z-Ordering
- 8.4Change Data Feed (CDF) vs Change Data Capture (CDC)
- 8.5Liquid Clustering (Latest Databricks Feature)
- 8.6Deletion Vectors for Faster Deletes
- 8.7Row-Level Concurrency
- 8.8Medallion architecture with Delta (Bronze → Silver → Gold)
- 8.9Auto Loader with Delta — Incremental File Ingestion
- Module 9: Data Engineering Pipelines9
- 9.1Batch Data Pipelines
- 9.2Incremental Data Processing
- 9.3SCD Type 1 & Type 2 Implementation
- 9.4ETL vs ELT
- 9.5Auto Loader — cloudFiles() for S3/ADLS
- 9.6Watermarking for Late-Arriving Data
- 9.7Idempotent & Fault-Tolerant Pipeline Design
- 9.8Full Load vs Incremental Load Strategies
- 9.9Data Quality Checks in Pipelines
- Module 10: Delta Live Tables (DLT)7
- Module 11: Databricks SQL8
- 11.1SQL Warehouses — Serverless vs Classic
- 11.2Running SQL Queries
- 11.3Creating Views & Materialized Views
- 11.4Query Optimization
- 11.5Databricks SQL Dashboards & Visualisations
- 11.6Query History & Query Profile Analysis
- 11.7Databricks SQL Alerts
- 11.8Connecting BI Tools — Power BI, Tableau to Databricks SQL
- Module 12: Spark Performance Optimization11
- 12.1Partitioning Strategy
- 12.2Repartition vs Coalesce – Where to use?
- 12.3Broadcast Joins – Different usecases
- 12.4Caching and Persistence
- 12.5Memory Management
- 12.6Executor memory, Driver Memory , cores -properly use
- 12.7Skew Handling — Salting Technique
- 12.8Reading Spark UI — Jobs, Stages, Tasks
- 12.9Spill to Disk — Causes & Fixes
- 12.10File Size Optimization — Small File Problem
- 12.11Predicate Pushdown & Column Pruning
- Module 13: Structured Streaming9
- 13.1Introduction to Streaming Concepts
- 13.2Batch vs Streaming Architecture difference
- 13.3Reading Streaming Data — with Kafka, Files
- 13.4Writing Streaming Data — Delta, Kafka
- 13.5Trigger Modes — Once, Fixed Interval, Continuous
- 13.6Stateful vs Stateless Streaming
- 13.7Checkpointing & Fault Recovery
- 13.8Streaming with Auto Loader
- 13.9Practical: Real-Time Order Processing Pipeline
- Module 14: Databricks Workflow Orchestration9
- 14.1Databricks Jobs & Scheduling
- 14.2Multi-Task Workflows
- 14.3Error Handling & Retries
- 14.4Job Clusters vs All-Purpose Clusters in Jobs
- 14.5Parameterised Jobs with Widgets
- 14.6Task Dependencies — Sequential & Parallel
- 14.7Email & Webhook Notifications on Job Failure
- 14.8Monitoring Jobs via Job Run History
- 14.9Integrating with Apache Airflow (Overview)
- Module 15: Unity Catalog & Data Governance9
- 15.1What is Unity Catalog
- 15.2Data Governance fundamental Concepts
- 15.3Access Control — Table Level & Column Level
- 15.4Three-Level Namespace — Catalog → Schema → Table
- 15.5Row-Level Security with Row Filters
- 15.6Data Lineage Tracking
- 15.7Tagging & Data Classification
- 15.8Audit Logs in Unity Catalog
- 15.9External Locations & Storage Credentials
- Module 16: Databricks with Cloud Platforms7
- 16.1Databricks on AWS — S3 Integration, IAM Roles
- 16.2Databricks on Azure — ADLS Integration, Service Principals
- 16.3Databricks on GCP — Overview
- 16.4AWS Glue vs Databricks — When to Use What
- 16.5Azure Data Factory + Databricks Integration
- 16.6Secrets Management — AWS Secrets Manager / Azure Key Vault
- 16.7Mounting Cloud Storage in Databricks
- Module 17: Real-Time Data Engineering7
- 17.1Kafka Integration
- 17.2Kafka Architecture — Topics, Partitions, Consumer Groups
- 17.3Kafka Vs Confluence kafka
- 17.4Producing & Consuming Messages from Databricks
- 17.5Exactly-Once Semantics with Kafka + Delta
- 17.6Practical: Real-Time Clickstream Analytics
- 17.7Practical: cdata Rest api, Nifi connect Databricks
- Module 18: CI/CD and Production Deployment7
- Module 19: End-to-End Data Engineering Projects5
- Module 20: Databricks Certification & Interview Preparation7
- 20.1Practice Questions — Full Mock Tests
- 20.2Databricks Certified Data Engineer Associate — Exam Overview
- 20.3Interview tips & Resume Preparation
- 20.4Top 50 Databricks Interview Questions & Answers
- 20.5Generative AI (github copilot) for code generation
- 20.6Claude AI for code Generation
- 20.7Linkedin tips to find job & get a job
OLTP vs OLAP Systems
Next
