🚲 Bike Demand Analytics Platform

A data platform showcasing modern data stack with incremental value from adding new data sources. This project demonstrates a complete end-to-end data pipeline analyzing NYC bike-share demand and weather impact.

📋 Table of Contents

Overview
Architecture
Tech Stack
Project Structure
Setup Instructions
Running the Pipeline
Demo Walkthrough
Holiday Data Integration
🎉 Historical Holiday Analysis (NEW)
Development
Testing
Data Sources
Troubleshooting

🎯 Overview

This project demonstrates a complete data pipeline that:

Part 1 - Single Source (Bike Trips)

Ingests NYC Citi Bike trip data from public APIs
Validates data quality with comprehensive checks
Transforms raw data into business-ready analytics
Provides interactive dashboards for demand analysis

Part 2 - Multi-Source Enhancement (+ Weather Data)

Adds weather data from Open-Meteo API
Enriches bike demand with weather context
Enables correlation and what-if analysis
Demonstrates incremental value of new data sources

Part 3 - Holiday Data Integration (+ Holiday Data) ✅ Spec 001 Completed

Integrates US public holiday data from Nager.Date API
Adds NYC-specific local holidays (NYC Marathon, Puerto Rican Day Parade)
Classifies holidays as major/federal vs optional/local
Identifies working vs non-working days for demand analysis

Part 4 - Historical Holiday Analysis (🆕 NEW) ✅ Spec 002 Completed

Analyzes historical bike demand around holidays vs regular weekdays
4 new dbt mart models (summary, by-station, by-hour, by-area)
Interactive dashboard with 6 comprehensive analysis sections
K-Means clustering for neighborhood-level demand visualization (10-50 adjustable clusters)
Statistical significance testing (t-test p-values)
Rebalancing recommendations for operations teams

Key Features

✅ Idempotent Data Ingestion - dlt ensures reliable, incremental loads ✅ Data Quality Gates - Great Expectations validates after data ingestion and transformation ✅ Modular Transformations - dbt provides testable, documented models ✅ End-to-End Orchestration - Airflow manages the complete workflow ✅ Interactive Analytics - Streamlit dashboards for self-service insights ✅ Intelligent Holiday Handling - Automatic merging of duplicate-date holidays, dynamic NYC event generation ✅ Modern Development - uv for fast dependency management, ruff for code quality

🏗️ Architecture

┌─────────────────┐   ┌──────────────────┐   ┌──────────────────┐
│  Citi Bike API  │   │  Open-Meteo API  │   │  Nager.Date API  │
│   (Bike Trips)  │   │    (Weather)     │   │    (Holidays)    │
└────────┬────────┘   └────────┬─────────┘   └────────┬─────────┘
         │                     │                       │
         └─────────────────────┼───────────────────────┘
                               ▼
                  ┌────────────────────────────────────┐
                  │        dlt Ingestion Layer         │
                  │      (HTTP → DuckDB Raw Tables)    │
                  └────────┬───────────────────────────┘
                           │
                           ▼
                  ┌────────────────────────────────────┐
                  │   Great Expectations Validation    │
                  │    (Input data quality checks)     │
                  └────────┬───────────────────────────┘
                           │
                           ▼
                  ┌────────────────────────────────────┐
                  │      dbt Transformation Layer      │
                  │       Staging → Core → Marts       │
                  │   (+ NYC holiday generation)       │
                  └────────┬───────────────────────────┘
                           │
                           ▼
                  ┌────────────────────────────────────┐
                  │        DuckDB Warehouse            │
                  │  (Optimized Analytics Tables)      │
                  └────────┬───────────────────────────┘
                           │
                           ▼
                  ┌────────────────────────────────────┐
                  │   Great Expectations Validation    │
                  │    (Transformed Quality Checks)    │
                  └────────┬───────────────────────────┘
                           │
                    ┌──────┴──────┐
                    ▼             ▼
              ┌──────────┐  ┌──────────────┐
              │ Streamlit│  │    Jupyter   │
              │Dashboard │  │   Notebooks  │
              └──────────┘  └──────────────┘

            All orchestrated by Apache Airflow

🛠️ Tech Stack

Component	Technology	Purpose
Data Ingestion	dlt (Data Load Tool)	Extract and load data from APIs
Data Validation	Great Expectations	Data quality testing and profiling
Data Transformation	dbt (Data Build Tool)	SQL-based transformations and tests
Data Warehouse	DuckDB	Embedded analytics database
Orchestration	Apache Airflow	Workflow automation and scheduling
Analytics	Polars, Jupyter	Fast data analysis and exploration
Visualization	Streamlit, Plotly	Interactive dashboards
Dependency Management	uv	Fast Python package management
Code Quality	ruff	Linting and formatting
Testing	pytest	Unit and integration tests

📁 Project Structure

weather-bike-demo/
├── airflow/
│   ├── dags/
│   │   └── bike_weather_dag.py      # Main orchestration DAG
│   └── airflow.cfg                  # Airflow configuration
├── dlt_pipeline/
│   ├── __init__.py
│   ├── bike.py                      # Bike data ingestion
│   ├── weather.py                   # Weather data ingestion
│   ├── holidays.py                  # Holiday data ingestion
│   └── .dlt/
│       └── config.toml              # dlt configuration
├── data_quality/
│   ├── gx/                          # Great Expectations directory
│   │   ├── great_expectations.yml   # GE configuration
│   │   ├── expectations/            # Data validation suites
│   │   ├── checkpoints/             # Validation checkpoints
│   │   └── uncommitted/             # Validation results (git-ignored)
│   ├── validate_bike_data.py        # Bike validation script
│   ├── validate_weather_data.py     # Weather validation script
│   └── validate_all.py              # Run all validations
├── dbt/
│   ├── dbt_project.yml              # dbt project config
│   ├── profiles.yml                 # Database connections
│   ├── packages.yml                 # dbt dependencies
│   └── models/
│       ├── staging/                 # Raw data cleaning
│       │   ├── stg_bike_trips.sql
│       │   ├── stg_weather.sql
│       │   └── stg_holidays.sql
│       ├── core/                    # Business logic
│       │   ├── dim_stations.sql     # 🆕 Enhanced with lat/lon + 10 geographic areas
│       │   └── fct_trips_daily.sql
│       └── marts/                   # Analytics-ready tables
│           ├── mart_demand_daily.sql
│           ├── mart_weather_effect.sql
│           ├── 🆕 mart_holiday_impact_summary.sql       # Citywide holiday impact (4 rows)
│           ├── 🆕 mart_holiday_impact_by_station.sql    # Station-level (8,370 rows)
│           ├── 🆕 mart_holiday_impact_by_hour.sql       # Hourly patterns (96 rows)
│           └── 🆕 mart_holiday_impact_by_area.sql       # Geographic areas (36 rows)
├── streamlit_app/
│   ├── Home.py                      # Main dashboard
│   └── pages/
│       ├── Weather.py               # Weather impact page
│       └── 🆕 Holiday_Impact.py     # Historical holiday analysis (6 sections)
├── notebooks/
│   └── polars_eda.ipynb             # Exploratory analysis
├── tests/
│   ├── test_bike_pipeline.py        # Bike ingestion tests
│   ├── test_weather_pipeline.py     # Weather ingestion tests
│   ├── test_holiday_pipeline.py     # Holiday ingestion tests
│   └── __init__.py
├── duckdb/
│   └── warehouse.duckdb             # DuckDB database (generated)
├── pyproject.toml                   # Python project config
├── ruff.toml                        # Code quality config
├── env.example                      # Environment variables template
└── README.md                        # This file

🚀 Setup Instructions

Prerequisites

Python 3.12 or higher
uv package manager
Git

Installation

Clone the repository

git clone <repository-url>
cd data-platform-ai-native

Install dependencies with uv

# uv will automatically create a virtual environment and install all dependencies
uv sync

Set up environment variables

cp env.example .env
# Edit .env if you need custom configurations

Verify installation

uv run ruff check .
uv run pytest tests/ -v

🏃 Running the Pipeline

Option 1: Manual Execution (Recommended for First Run)

Step 1: Ingest Bike Data

uv run python dlt_pipeline/bike.py

This downloads May-June 2024 NYC Citi Bike data and loads it into DuckDB.

Step 2: Validate Bike Data

uv run python data_quality/validate_bike_data.py

This runs comprehensive data quality checks including:

Column nullability and uniqueness
Value ranges and accepted values
Data type validations
Row count checks

Results are saved to Data Docs for detailed inspection.

Step 3: Run dbt Transformations (Bike Only)

cd dbt
uv run dbt deps    # Install dbt packages
uv run dbt build   # Run models and tests
cd ..

Step 4: View Part 1 Dashboard

uv run streamlit run streamlit_app/Home.py

Open http://localhost:8501 to see the bike demand dashboard.

Step 5: Ingest Weather Data

uv run python dlt_pipeline/weather.py

Step 6: Validate Weather Data

uv run python data_quality/validate_weather_data.py

Or validate all data sources at once:

uv run python data_quality/validate_all.py

Step 7: Re-run dbt with Weather

cd dbt
uv run dbt build
cd ..

Step 8: View Enhanced Dashboard

uv run streamlit run streamlit_app/Home.py

Navigate to the "Weather" page to see weather impact analysis.

Step 9: Ingest Holiday Data

uv run python dlt_pipeline/holidays.py

This ingests US public holidays from Nager.Date API and applies intelligent merging:

Holidays on the same date are merged (e.g., "Columbus Day and Indigenous Peoples' Day")
Counties are merged and deduplicated
Holiday types are prioritized (Public > Federal > Optional)

Step 10: Re-run dbt with Holidays

cd dbt
uv run dbt build
cd ..

The dbt transformation automatically adds NYC-specific holidays:

NYC Marathon (First Sunday in November)
Puerto Rican Day Parade (Second Sunday in June)

And derives useful flags:

is_major: True for Public/Federal holidays
is_working_day: False for major holidays (inverse of is_major)

Step 11: View Full Dashboard with Holiday Insights

uv run streamlit run streamlit_app/Home.py

Now you can analyze bike demand patterns around holidays and compare working vs non-working days.

Option 2: Airflow Orchestration

Initialize Airflow in standalone mode

export AIRFLOW_HOME=$PWD/airflow

uv run airflow standalone

Trigger the DAG

uv run airflow dags trigger bike_weather_pipeline

Or use the Airflow UI at http://localhost:8080 (login: will be displayed in logs)

🎬 Demo Walkthrough

Act 1: Single Source Demo

Goal: Show value from bike trip data alone

Show Data Ingestion
- Run bike data ingestion
- Point out dlt's idempotent loading (run twice to show no duplicates)
Demonstrate Data Quality
- Run Great Expectations validation: uv run python data_quality/validate_bike_data.py
- Show validation reports in data_quality/gx/uncommitted/data_docs/local_site/
- Highlight key expectations: nullability, uniqueness, value ranges
Explore Transformations
- Show dbt models structure
- Run dbt docs generate && dbt docs serve to show lineage
- Highlight tests and data quality checks
View Dashboard
- Open Streamlit Home page
- Show daily demand trends
- Display member vs casual breakdown
- Highlight top stations

Key Message: "We have a working pipeline with one data source providing valuable insights."

Act 2: Add Weather Source

Goal: Demonstrate incremental value of adding weather data

Add New Data Source
- Run weather ingestion: uv run python dlt_pipeline/weather.py
- Validate weather data: uv run python data_quality/validate_weather_data.py
- Show Great Expectations catching any quality issues (e.g., tmax >= tmin)
Show Enhanced Models
- Display new dbt models (stg_weather, mart_weather_effect)
- Show updated lineage graph with weather joins
Demonstrate New Insights
- Navigate to Weather page in Streamlit
- Show correlation between weather and demand
- Demonstrate "what-if" temperature analysis
- Filter by rainy vs dry days
Highlight Architecture Benefits
- Same pipeline pattern for new sources
- Validation at every stage
- Tests ensure quality
- Orchestration handles dependencies

Key Message: "Same architecture, new source, new insights. This is how we scale data products responsibly."

Act 3: Add Holiday Impact Analysis

Goal: Show how holiday context enhances demand forecasting

Ingest Holiday Data
- Run holiday ingestion: uv run python dlt_pipeline/holidays.py
- Show intelligent merging: holidays on same date merged (e.g., "Columbus Day and Indigenous Peoples' Day")
- Display retry logic handling network failures
Dynamic Holiday Generation
- Show dbt model automatically generating NYC-specific holidays
- Explain date arithmetic for dynamic events (First Sunday in November, etc.)
- Highlight deduplication logic (Public holidays prioritized over local events)
Holiday Classification
- Demonstrate derived flags:
  - is_major: Public/Federal holidays = true
  - is_working_day: Major holidays = false (non-working day)
- Show how this enables working day vs holiday demand comparison
Enhanced Analytics
- Compare bike demand on holidays vs regular days
- Identify major holidays with different demand patterns (Memorial Day, July 4th, etc.)
- Show NYC-specific event impact (Marathon Sunday sees different station patterns)

Key Message: "Holiday context is critical for demand forecasting. Our pipeline automatically enriches the data with both national and local events, classified for business use."

📅 Holiday Data Integration

Overview

The holiday data integration pipeline enriches bike demand analytics with US public holidays and NYC-specific local events. This enables demand pattern analysis around holidays and improves forecasting accuracy.

Data Source: Nager.Date API

The pipeline uses the Nager.Date API to fetch official US public holidays. The API provides:

Federal holidays (nationwide observances)
State-specific holidays with county-level granularity
Holiday metadata (fixed vs floating dates, holiday types)

Intelligent Holiday Merging

Some dates have multiple holidays (e.g., Columbus Day and Indigenous Peoples' Day on October 14th). The pipeline intelligently merges these:

Merge Strategy:

Names: Concatenated with " and " (e.g., "Columbus Day and Indigenous Peoples' Day")
Counties: Merged, deduplicated, and sorted (e.g., "US-AK,US-CA,US-NY,US-OR")
Types: Prioritized as Public > Federal > Optional > Local
Boolean flags: OR logic (is_global = true if ANY holiday is global)

Example:

# API returns 2 holidays for 2024-10-14:
Holiday 1: "Columbus Day" (Public, counties: US-NY, US-CA)
Holiday 2: "Indigenous Peoples' Day" (Public, counties: US-AK, US-OR)

# Pipeline merges to single record:
{
  "date": "2024-10-14",
  "holiday_name": "Columbus Day and Indigenous Peoples' Day",
  "counties": "US-AK,US-CA,US-NY,US-OR",
  "holiday_types": "Public"
}

NYC-Specific Local Holidays

The dbt staging model dynamically generates NYC-specific events using SQL date arithmetic:

NYC Marathon:

When: First Sunday in November
Calculation: November 1st + days until first Sunday
Impact: Affects bike demand near Central Park, finish line stations

Puerto Rican Day Parade:

When: Second Sunday in June
Calculation: June 1st + days until first Sunday + 7 days
Impact: Affects bike demand along Fifth Avenue parade route

These holidays are generated dynamically for all years present in the source data (no hardcoded dates).

Holiday Classification

The pipeline derives two boolean flags for business use:

is_major:

True: Public or Federal holidays (e.g., July 4th, Memorial Day, Thanksgiving)
False: Optional or Local holidays (e.g., Groundhog Day, NYC Marathon)
Purpose: Identify holidays with widespread business closures

is_working_day:

False: Major holidays (most businesses closed, different commute patterns)
True: Other holidays (normal work schedules)
Purpose: Enable working day vs non-working day demand comparison

Classification Logic (in dbt/models/staging/stg_holidays.sql:128-140):

-- Derive is_major flag
case
    when holiday_types like '%Public%' then true
    when holiday_types like '%Federal%' then true
    else false
end as is_major,

-- Derive is_working_day flag (inverse of is_major)
case
    when holiday_types like '%Public%' then false
    when holiday_types like '%Federal%' then false
    else true
end as is_working_day

Deduplication Strategy

When the same date has both a national holiday and an NYC local event, the pipeline prioritizes national holidays:

Prioritization (in dbt/models/staging/stg_holidays.sql:100-115):

row_number() over (
    partition by date
    order by
        case
            when holiday_types like '%Public%' then 1
            when holiday_types like '%Federal%' then 2
            else 3  -- Local/Optional holidays ranked lowest
        end,
        holiday_name  -- Tie-breaker: alphabetical
) as row_num

Only the highest-priority holiday per date is kept in the final staging table.

Retry Logic and Reliability

The dlt pipeline includes exponential backoff retry logic for API failures:

Retry Configuration (in dlt_pipeline/holidays.py:51-81):

Max Retries: 3 attempts
Backoff Delays: 1s, 2s, 4s (exponential)
Failure Behavior: Pipeline fails after all retries exhausted (prevents partial data loads)

Example:

Attempt 1: ❌ Timeout
  → Wait 1 second
Attempt 2: ❌ Connection refused
  → Wait 2 seconds
Attempt 3: ✅ Success

Running the Holiday Pipeline

Basic Usage:

# Ingest 2024 US holidays (default)
uv run python dlt_pipeline/holidays.py

Multi-Year Ingestion: Edit dlt_pipeline/holidays.py line 244:

# Ingest multiple years
result = run_holiday_pipeline([2024, 2025, 2026])

Verification:

# Check raw holiday data
cd dbt
uv run dbt run --select stg_holidays

# Query results
uv run duckdb duckdb/warehouse.duckdb "SELECT * FROM staging.stg_holidays ORDER BY date LIMIT 10"

Testing

Unit Tests (tests/test_holiday_pipeline.py):

Mock API responses for deterministic testing
Test retry logic (success after failures, failure after max retries)
Test merge logic (single holiday, multiple holidays, type prioritization)
Test field transformations (arrays to comma-separated strings)
Test metadata enrichment (timestamps, source years)

Run Tests:

PYTHONPATH=. uv run pytest tests/test_holiday_pipeline.py -v

dbt Tests (dbt/models/staging/schema.yml):

Date column uniqueness (no duplicate dates)
Not null constraints on critical fields
Accepted values for country_code (US only)
Boolean flag validation (is_major, is_working_day not null)

Run dbt Tests:

cd dbt
uv run dbt test --select stg_holidays

💻 Development

Running Tests

# Run all tests
uv run pytest tests/ -v

# Run specific test file
uv run pytest tests/test_bike_pipeline.py -v

# Run with coverage
uv run pytest tests/ --cov=dlt_pipeline --cov-report=html

Code Quality

# Format code
uv run ruff format .

# Lint code
uv run ruff check .

# Auto-fix issues
uv run ruff check . --fix

Adding New Dependencies

# Add a new package
uv add package-name

# Add a dev dependency
uv add --dev package-name

# Update dependencies
uv sync

Jupyter Notebook Analysis

# Start Jupyter
uv run jupyter notebook

# Or use JupyterLab
uv run jupyter lab

# Open notebooks/polars_eda.ipynb

🧪 Testing

Unit Tests

Bike Pipeline Tests: Mock HTTP responses, test data parsing and transformations
Weather Pipeline Tests: Mock API calls, verify error handling
Holiday Pipeline Tests: Mock Nager.Date API, test intelligent merging, retry logic, field transformations
dbt Tests: Schema tests, data quality tests defined in YAML

Data Quality Tests

Great Expectations validates data at every stage. The data_quality directory contains:

Expectation Suites: Define data quality rules
- bike_trips_suite: expectations for bike trip data
- weather_suite: expectations for weather data
Checkpoints: Execute validation workflows
Validation Scripts: Standalone scripts for manual validation

To run validations manually:

# Validate bike trips data
uv run python data_quality/validate_bike_data.py

# Validate weather data
uv run python data_quality/validate_weather_data.py

# Validate all data sources
uv run python data_quality/validate_all.py

View detailed validation results in Data Docs:

# Open Data Docs in browser
open data_quality/gx/uncommitted/data_docs/local_site/index.html

🔧 Troubleshooting

Common Issues

Issue: dlt ingestion fails with "No module named 'dlt'"

# Solution: Ensure you're using uv run
uv run python dlt_pipeline/bike.py

Issue: DuckDB database locked

# Solution: Close all connections, then restart
# Make sure only one process accesses the database at a time

Issue: dbt cannot find profiles.yml

# Solution: Run dbt from the dbt directory or specify profiles dir
cd dbt && uv run dbt build --profiles-dir .

Issue: Streamlit shows "No data available"

# Solution: Ensure data pipeline has been run
uv run python dlt_pipeline/bike.py
cd dbt && uv run dbt build

Issue: Airflow DAG not appearing

# Solution: Check DAG file for syntax errors
uv run python airflow/dags/bike_weather_dag.py
# Refresh Airflow UI or restart scheduler

If that will not help try to add this to your bashrc/zshrc file

´´´bash export PYTHONFAULTHANDLER=true export AIRFLOW__CORE__LOAD_EXAMPLES=False export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES ´´´

Performance Tips

Large Datasets: Adjust batch sizes in dlt pipeline configurations
dbt Performance: Use --threads flag for parallel execution
DuckDB Memory: DuckDB automatically manages memory, but for very large datasets consider using external Parquet files

📊 Data Sources

Citi Bike Trip Data

Source: NYC Citi Bike System Data (Amazon S3)
Format: Monthly CSV files (gzipped)
Coverage: May-June 2024 in demo
Update: Monthly
License: Public data

Weather Data

Source: Open-Meteo Historical Weather API
Format: JSON API
Coverage: May-June 2024 for NYC (40.73°N, 73.94°W)
Update: Daily historical data
License: Free for non-commercial use

Holiday Data

Source: Nager.Date Public Holiday API
Format: JSON API
Coverage: US federal and public holidays (2024+)
Special Features:
- Intelligent date deduplication (merges holidays on same date)
- County-level holiday tracking
- Holiday type classification (Public, Federal, Optional, Local)
- NYC-specific local events (Marathon, Puerto Rican Day Parade)
Update: Annual (holidays are relatively static)
License: Free public API

🤝 Contributing

This is a demo project, but contributions are welcome! Please:

Fork the repository
Create a feature branch
Run tests and linting
Submit a pull request

📝 License

MIT License - see LICENSE file for details

🙏 Acknowledgments

NYC Citi Bike for open data
Open-Meteo for weather API
dlt, dbt, Great Expectations, and Streamlit communities

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.awos		.awos
.claude		.claude
airflow		airflow
context		context
data_quality		data_quality
dbt		dbt
dlt_pipeline		dlt_pipeline
notebooks		notebooks
old_context		old_context
streamlit_app		streamlit_app
tests		tests
.gitignore		.gitignore
.mcp.json		.mcp.json
.python-version		.python-version
CLAUDE.md		CLAUDE.md
README.md		README.md
env.example		env.example
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

provectus/awos-data-engineering-showroom

Folders and files

Latest commit

History

Repository files navigation