A comprehensive data pipeline for a fictional music streaming service, built with AWS services, Apache Airflow, and modern data engineering practices.
AWS S3
Object storage for the data lake
AWS Glue
Serverless ETL service
AWS Lambda
Serverless compute for data ingestion
AWS Redshift
Data warehouse for analytics
Apache Spark
Distributed data processing
Apache Iceberg
Table format for data lakes
AWS Glue Data Quality
Data validation service
dbt
Data transformation tool
Apache Airflow
Workflow orchestration platform
AWS Step Functions
Serverless workflow service
AWS Glue Workflow
Managed ETL workflows
Terraform
Infrastructure provisioning
AWS CloudFormation
AWS resource templating
Docker
Container platform for services
Python
Primary programming language
SQL
Data query language
PySpark
Python API for Apache Spark
Git
Version control system
GitHub Actions
CI/CD automation
Apache Superset
Data visualization platform
AWS CloudWatch
Monitoring and observability
Grafana
Metrics visualization
Extract Layer
AWS Glue jobs extract data from API endpoints and RDS databases, storing raw data in S3 landing zone with date-based partitioning.
Transform Layer
Glue ETL jobs transform data, adding metadata and proper formatting, storing the result in Apache Iceberg tables.
Data Quality
AWS Glue Data Quality validates data against rule sets that check for completeness, uniqueness, and data type compliance.
Serving Layer
Amazon Redshift hosts the dimensional model (star schema) created with dbt, serving as a data warehouse for analytics.
Apache Airflow orchestrates the pipeline through DAGs that define the entire data flow from extraction to analytics.
Songs Pipeline DAG ⬇️
API Pipeline DAG ⬇️
The project uses dbt to implement a star schema model and analytical views for business intelligence.
Star Schema Model
The dimensional model consists of fact tables (sessions, purchases) and dimension tables (users, artists, songs, time) to support analytical queries.