Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Jan 8, 2026

Overview

This PR delivers the complete DataJoint 2.0 documentation, restructured following the Diátaxis framework for technical documentation. The documentation accompanies the DataJoint 2.0 release (datajoint-python).

Documentation Structure

Tutorials (Learning-oriented)

12 executable Jupyter notebooks organized by complexity:

Core Tutorials:

Tutorial Topic
01-getting-started Connection, schema creation, first tables
02-schema-design Entity relationships, foreign keys, dependencies
03-data-entry Inserting data, transactions, immutability
04-queries Query algebra, restrictions, projections, joins
05-computation AutoPopulate, make methods, error handling
06-object-storage External storage, <object@>, large data

Examples:

Tutorial Topic
University Real-world university database example
Fractal Pipeline Scientific pipeline with computation DAG
Blob Detection Image processing pipeline
Hotel Reservations Booking system with date ranges
Languages Many-to-many relationships

Domain-Specific:

Tutorial Topic
Calcium Imaging Two-photon imaging pipeline
Electrophysiology Neural recordings pipeline
Ephys with Object Storage NpyCodec for large arrays
Allen CCF Brain atlas integration

Advanced:

Tutorial Topic
SQL Comparison DataJoint vs SQL syntax
JSON Type JSONB attribute usage

All notebooks are tested with pytest --nbmake.

How-To Guides (Task-oriented)

20+ practical guides for common tasks:

  • Setup: installation, configure-database, configure-storage
  • Schema Design: define-tables, design-primary-keys, model-relationships, alter-tables
  • Data Operations: insert-data, fetch-results, delete-data, query-data
  • Computation: run-computations, distributed-computing, monitor-progress
  • Advanced: create-custom-codec, manage-large-data, use-object-storage, staged-insert, handle-errors
  • Project Management: manage-pipeline-project, backup-restore
  • Migration: Comprehensive AI-assisted migration guide from 0.x to 2.0

Concepts (Explanation-oriented)

Conceptual documentation organized by topic:

Foundations:

  • What is DataJoint
  • Data Pipelines
  • Data Integrity
  • Normalization
  • History

Schema Design:

  • Entity-Relationship Model
  • Schema Dimensions
  • Reading Diagrams

Query System:

  • Query Operators
  • Semantic Matching

Data Management:

  • Object-Augmented Schemas
  • Custom Codecs
  • What's New in 2.0
  • FAQ

Reference (Information-oriented)

Specifications (15 documents):

Spec Content
table-declaration Table definition syntax, attribute types, core types
query-algebra Query operators, semantic matching, SQL transpilation
data-manipulation Insert, delete, update operations
autopopulate Populate, make methods, job management
master-part Part tables, integrity constraints
object-storage External stores, hash/schema-addressed storage
virtual-schemas spawn_missing_classes, make_classes
migration 0.x to 2.0 migration phases
errors Error types and handling
definition-syntax Complete BNF grammar
operators Query operator reference
configuration Config hierarchy, environment variables
url-representation Table URL format

API Documentation:

  • Auto-generated from datajoint-python docstrings via mkdocstrings
  • Covers all public classes and methods

Additional Content

Elements:

  • DataJoint Elements overview with NIH U24 background
  • Links to individual Element repositories

Publications:

  • Comprehensive list of papers using DataJoint (2014-2025)
  • 50+ peer-reviewed publications

About:

  • Citation guidelines
  • History
  • License (CC BY 4.0 for docs, Apache 2.0 for code)

Key Features

Diátaxis Framework

All content classified into exactly one category:

  • Tutorials: Learning-oriented, step-by-step
  • How-To: Task-oriented, goal-focused
  • Concepts: Understanding-oriented, explanatory
  • Reference: Information-oriented, accurate

DataJoint 2.0 API

All examples use the new 2.0 API:

  • Modern fetch API (no deprecated .fetch() method)
  • Core types (int32, float64, varchar, uuid, json)
  • Codec syntax (<blob>, <blob@store>, <npy@store>, <object@store>)
  • dj.Top for single-row lookups
  • staged_insert1 for direct object storage writes
  • Semantic matching with exclude_nonmatching parameter

Executable Examples

  • All tutorials are Jupyter notebooks
  • Tested with pytest --nbmake
  • Include expected outputs
  • Use Docker Compose for database setup

Migration Guide

Comprehensive 7-phase migration from 0.x to 2.0:

  1. Code migration (API changes)
  2. Type annotations (core types)
  3. Surrogate key annotations
  4. Lineage table creation
  5. Foreign key conversion
  6. External storage migration
  7. AdaptedTypes → Codecs

Includes AI-assisted migration prompts and safety checks.

Visual Documentation

  • ER diagrams generated with mermaid-cli
  • Pipeline diagrams for domain tutorials
  • Consistent diagram notation documented

Technical Changes

Dependencies Simplified

Removed unused packages from pip_requirements.txt:

  • mike (version provider removed)
  • mkdocs-redirects (not configured)
  • mkdocs-pymdownx-material-extras (not used)
  • nbconvert (transitive dependency)

Navigation Reorganized

  • Elements moved under Reference
  • Concepts organized into logical sections
  • Specs organized by topic (Schema, Queries, Data, Storage, etc.)

License Updated

  • Documentation: CC BY 4.0
  • Code examples: Apache 2.0 (consistent with datajoint-python)

Commits Summary

This PR includes 100+ commits covering:

  • Complete Diátaxis restructure
  • 15 specification documents
  • 12+ executable tutorials
  • 20+ how-to guides
  • Comprehensive concept explanations
  • API documentation pipeline
  • Migration guides
  • Visual diagram documentation
  • Publications list (50+ papers)
  • Elements with NIH U24 background

dimitri-yatsenko and others added 30 commits January 4, 2026 13:59
- Create new directory structure: explanation/, tutorials/, how-to/, reference/specs/, api/
- Add index pages for each section with content outlines
- Update mkdocs.yaml with new navigation (removed partnerships/publications)
- Add mkdocs-jupyter for notebook support
- Update README with comprehensive project description
- Add about/index.md and about/contributing.md
- Update license references to Apache 2.0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Migrated spec documents:
- primary-keys.md - Primary key rules in query operators
- semantic-matching.md - Attribute lineage and join compatibility
- type-system.md - Three-layer type architecture
- codec-api.md - Custom codec implementation
- fetch-api.md - Data retrieval methods
- autopopulate.md - Jobs 2.0 specification
- job-metadata.md - Hidden job tracking columns

Updated specs/index.md with proper categorization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Created explanation pages based on datajoint-book concepts:
- relational-workflow-model.md - Core paradigm, three approaches compared
- entity-integrity.md - Primary keys, three questions framework
- normalization.md - Workflow normalization principle
- query-algebra.md - Five operators with examples
- type-system.md - Three-layer architecture, codecs
- computation-model.md - AutoPopulate, Jobs 2.0

Updated explanation/index.md with grid card layout.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Added explanation/custom-codecs.md covering codec extensibility
- Updated TERMINOLOGY.md with codec extensibility terms
- Updated mkdocs.yaml navigation
- Updated explanation/index.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Added mkdocstrings, gen-files, literate-nav plugins
- Created scripts/gen_api_pages.py for auto-generating API docs
- Updated mkdocs.yaml with API generation configuration
- Created reference pages: configuration.md, definition-syntax.md, errors.md
- Updated api/index.md with module links
- Added pip requirements for doc generation

API docs are auto-generated from datajoint-python/src docstrings using
NumPy-style format.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Archive elements/ (to be documented separately)
- Archive partnerships/ and projects/ (handled elsewhere)
- Archive support-events.md and additional-resources.md
- Remove redundant about/ files (about.md, contribute.md, datajoint-team.md)
- Update index.md to remove Elements reference
- Update nav to remove Elements section

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Comprehensive spec covering:
- Table tiers and class structure
- Definition string grammar
- Attribute types (core, string, temporal, codec)
- Default values and nullable attributes
- Foreign key references and options
- Index declarations
- Part tables
- Auto-populated tables
- Validation rules
- SQL generation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Comprehensive spec covering all query operators:
- Restriction (& and -): condition types, semantic matching
- Projection (.proj): selection, renaming, computed attributes
- Join (*): functional dependencies, PK determination, left join
- Aggregation (.aggr): grouping, aggregate functions, HAVING
- Extension (.extend): left join with A→B requirement
- Union (+): combining entity sets, PK requirements
- Universal sets (dj.U): unique values, global aggregation

Also covers:
- Semantic matching rules and lineage
- Operator precedence
- Subquery generation rules
- Quick reference table

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Comprehensive spec covering insert, update1, and delete operations:

- Workflow normalization philosophy: insert/delete as primary ops
- Updates as surgical corrections (update1 only, by design)
- The recomputation pattern for data corrections

Insert operations:
- insert() with all parameters and input formats
- insert1() convenience method
- staged_insert1 for large objects (Zarr, HDF5)
- Handling duplicates, extra fields, auto-populated tables

Update operations:
- update1() requirements and constraints
- When to use vs when to delete/reinsert
- Why no bulk update (by design)

Delete operations:
- Cascade behavior to dependent tables
- Safe mode and transaction control
- Part table constraints
- delete_quick() for internal use

Also covers validation, transactions, error handling, best practices.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Restructured to present DataJoint 2.0 as the status quo:

- Starts with fundamentals: table types, make() method, key_source
- Explains populate() method and operating modes
- Describes per-table jobs system as native feature
- Covers priority, scheduling, distributed computing
- Migration from 1.x moved to brief section at end

Removed problem/solution framing that assumed 1.x knowledge.
Now readable as standalone 2.0 documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Added comprehensive coverage of:

- Key source calculation: automatic derivation from FK joins, custom key sources
- The populate process: execution flow, direct mode behavior, return values
- The make() method: basic pattern, requirements, tripartite make (generator and method-based)
- Transaction management: automatic transactions, atomicity, scope diagrams
- Part tables: computed results with parts, transaction behavior, cascading deletes
- Progress monitoring: progress() method, display_progress parameter
- Direct vs distributed mode comparison

Reorganized to present basic populate first, job reservation as an extension.
Tripartite make pattern documented with both generator and method approaches.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Tutorials:
- 01-getting-started: Blob detection pipeline example
- 02-schema-design: Table tiers, keys, relationships, core types
- 03-data-entry: Insert, update, delete operations
- 04-queries: Restriction, projection, join, aggregation, fetch
- 05-computation: Computed tables, make(), populate()

Updates:
- Home page: Relational Workflow Model explanation
- Type system: Core types vs native types distinction
- Schema design: Master-part relationships, compositional integrity
- All tutorials use DataJoint 2.0 API (to_arrays, to_dicts, keys)
- Dates updated to January 2026

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Explain OAS: unified architecture for relational + object storage
- Clarify "object" terminology (data objects, not OOP)
- Emphasize that object storage is managed with same rigor as database
- List key OAS features: transparent access, lifecycle, deduplication
- Update Quick Start dates to 2026

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Remove replace=True example, add caveat about breaking immutability
- Introduce master-part with transactions for compositional integrity
- Explain auto-populated tables enforce transactions automatically
- Manual tables need explicit transactions for master-part inserts
- All session+trial inserts now use transactions
- Update best practices to emphasize transaction usage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Fixes:
- 02-schema-design: Add task_params=None for consistent field sets
- 03-data-entry: Fix to_arrays() usage for single column
- 05-computation: Cast numpy bool to Python bool for is_fast

All 5 tutorials now execute successfully with outputs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
API updates:
- Replace safemode parameter with prompt in delete()
- Remove download_path from fetch methods (use config.override instead)
- Update fetch-api spec with config-based download path

All tutorials re-executed and pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
The fetch module was removed in modern-fetch-api merge.
Fetch methods are now on QueryExpression directly.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
These terms are misnomers - they are restriction operations, not joins.
Replaced with:
- "Restriction by Query Expression"
- "restriction" / "anti-restriction"

Added reference to semantic matching spec for attribute matching.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Explain that semantic matching prevents accidental matches on
unrelated attributes that happen to share names.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Replace keep_all_rows with exclude_nonmatching (inverted logic)
- Default behavior now keeps all rows (LEFT JOIN)
- Update query-algebra.md and primary-keys.md specs
- Expand queries tutorial with:
  - Join primary key determination via functional dependencies
  - Entity-to-entity aggregation concept
  - Extension operator (.extend())
  - Universal set (dj.U()) for ad-hoc groupings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Explain default behavior keeps all entities (even without matches)
- Show count(pk_attr) vs count(*) for correct zero counts
- Add exclude_nonmatching=True example for filtering

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Clarify that prompt default is determined by config['safemode']
- Not hardcoded to True or "interactive mode"
- Update best practices section accordingly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
User-friendly reference covering all query operators:
- Restriction (&) and anti-restriction (-)
- Projection (.proj())
- Join (*)
- Extension (.extend())
- Aggregation (.aggr())
- Union (+)
- Universal set (dj.U())
- Operator precedence
- Semantic matching explanation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
entity-integrity.md:
- Fix surrogate key definition: used inside database, not exposed to users
- Replace auto_increment with UUID (no auto-increment in DataJoint)
- Update all examples to use core DataJoint types (uint32, float32, etc.)
- Use <blob> for blob storage type
- Use datetime(3) for millisecond, datetime(6) for microsecond precision

computation-model.md:
- Add three-part make model for long-running computations
- Explain make_fetch, make_compute, make_insert pattern
- Document re-fetch verification for referential integrity
- Explain when to use standard vs three-part make
- Fix int to uint32 in Segmentation example

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- installation.md — Install DataJoint and set up environment
- configure-database.md — Database connection with secrets separation
- define-tables.md — Table definitions with core DataJoint types
- insert-data.md — Insert patterns including transactions
- query-data.md — Query operators quick reference
- fetch-results.md — Output methods and formats
- run-computations.md — populate() and three-part make

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove unnecessary int() and bool() wrappers around boolean values
now that datajoint-python properly handles np.bool_ types.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update LICENSE from MIT to Apache 2.0 with copyright:
Copyright 2014-2026 DataJoint Inc. and contributors

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Tutorials:
- Add tutorial 06: Object Storage (externals, attachments, file stores)
- Add advanced tutorials: custom codecs, distributed computing, migration
- Fix distributed.ipynb multiprocessing demo (explain module requirement)
- Minor updates to tutorials 01-03 for consistency

How-to guides:
- Add 14 new task-oriented guides covering common operations
- Expand index with full guide listing

Explanation:
- Expand entity integrity section

Config:
- Update mkdocs.yaml navigation for new content
- Add new images for documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
installation.md:
- Change mysql-connector-python to pymysql
- Update Python requirement to 3.10+
- Add DataJoint.com as recommended managed service

define-tables.md:
- Add Schema creation explanation
- Separate core types from built-in codecs
- Add json as core type (no angle brackets)
- Document built-in codecs: blob, attach, object@store
- Move indexes to end of definition examples
- Clarify tables declared at @Schema decorator time
- Add schema.drop() and table.drop() for prototyping
- Use uint16 instead of int in examples

configure-database.md:
- Remove untested multiple connections section
- Add DataJoint.com tip

configure-storage.md:
- Add DataJoint.com tip for pre-configured storage

backup-restore.md:
- Add DataJoint.com tip for automatic backups

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
dimitri-yatsenko and others added 20 commits January 14, 2026 02:07
Updated mkdocs.yaml navigation:
- Changed 'Migrate from 0.x' to 'Migrate to 2.0'
- Points to new parallel schema migration guide (migrate-to-v20.md)

The new guide provides a safer migration approach:
- Zero production risk during testing
- Unlimited practice runs in _v20 schemas
- Easy rollback at every phase
- Side-by-side validation

Old in-place migration guide (migrate-from-0x.md) remains in repo
but is no longer linked in navigation.

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Major revision of migration guide to use standard git workflow:

**Key Changes:**
1. Git branch approach:
   - Pin DataJoint 0.14.6 on main branch
   - Create migrate-to-v2 branch for DataJoint 2.0
   - Use _v2 suffix for parallel schemas

2. Agentic code migration (~1 hour):
   - Detailed AI agent prompt for automated migration
   - Schema declarations, fetch API, type syntax
   - Defers external storage migration to Phase 2

3. Flexible data approach:
   - Option A: Fresh data for fast testing
   - Option B: Copy production data with pointer migration

4. Simpler cutover:
   - Merge branch when ready
   - Rename schemas or keep _v2 suffix
   - Standard git revert for rollback

**Advantages over previous plan:**
- Standard git workflow (familiar to developers)
- AI-assisted migration saves hours
- External storage deferred (optional)
- Easy rollback (git checkout main)
- No production risk during testing

Timeline: Small pipeline ~2 days, medium ~1 week (vs ~3-6 weeks before)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Added detailed requirements section (Python 3.10+, MySQL 8.0+, license change)
- Documented What's New in 2.0 (3-tier type system, codecs, unified stores)
- Organized into 4 clear phases with detailed timelines
- Phase I: Branch and code migration (~1-4 hours with AI assistance)
  - Pin legacy on main branch
  - Create pre/v2.0 migration branch
  - Configure DataJoint 2.0 and object storage
  - Convert table definitions with AI prompt
  - Convert query/insert code with AI prompt
  - Convert populate methods with AI prompt
- Phase II: Test with sample data (~1-2 days)
- Phase III: Migrate production data (~1-7 days, 3 options)
  - Option A: Copy and rename (recommended)
  - Option B: In-place migration
  - Option C: Gradual with legacy compatibility
- Phase IV: Adopt new features (ongoing)
- Emphasized key principles:
  - Production runs undisturbed through Phase II
  - Git branch workflow for safety
  - External storage deferred to Phase III
  - Agentic (AI-assisted) migration reduces time from weeks to hours
- Added comprehensive examples, troubleshooting, and cross-references
- Total timeline reduced from 3-6 weeks to 1-2 weeks
… III

Critical corrections based on user feedback:

Phase I changes:
- Convert ALL codecs including external storage (blob@, attach@, filepath@)
- Use TEST stores for development
- External storage CODE implemented in Phase I
- Only DATA migration deferred to Phase III

Phase II changes:
- Rename to "Test Compatibility and Equivalence"
- Add Step 5: Compare with Legacy Schema
- Emphasize side-by-side testing of legacy vs v2
- Validate that results are equivalent before touching production

Phase III changes:
- Emphasize this is DATA migration only (code complete from Phase I)
- Add Step 0: Configure Production Stores
- Clarify external storage metadata migration (UUID → JSON)
- No file copying needed (keep in place)

Key principle clarified throughout:
- Phase I: ALL code changes (using test stores)
- Phase II: Equivalence testing
- Phase III: Production data migration only
User feedback: avoid 'external' since it's all integrated

Terminology changes throughout migration guide:
- 'external storage' → 'in-store codecs' or 'in-store'
- 'External Storage Codecs' → 'In-Store Codecs'
- Storage column label: 'External' → 'In-store'
- Consistently use 'in-table' vs 'in-store' distinction

Key concept clarification:
- In-table: Data serialized into MySQL table (<blob>, <attach>)
- In-store: Data stored in object stores (<blob@>, <npy@>, <filepath@>)

Both are integrated into DataJoint - 'external' implied separation
which doesn't reflect the unified architecture of DataJoint 2.0
Schema-addressed storage corrections:
- <npy@> and <object@> are NEW in 2.0 (not migration targets)
- Updated codec table to show 'New in 2.0' for schema-addressed types
- Clarified these are adopted in Phase IV, not migrated in Phase I
- Updated AI agent prompt to distinguish legacy vs new codecs
- Removed suggestion to migrate to <npy@> in example
- Added optional Phase IV adoption example

Bullet list formatting fixes:
- Add blank lines before all bullet lists for proper markdown rendering
- Fixed 'Key principles', 'Timeline', 'End state', 'Prerequisites',
  'Options', 'Advantages', 'What this does' sections
- Ensures consistent rendering across markdown parsers
Critical corrections based on user feedback:

1. Codecs Section (What's New):
   - Split into 'Migration: Legacy → 2.0' and 'New in 2.0' sections
   - Clarified 0.14.x had IMPLICIT serialization (longblob auto-serialized)
   - 2.0 makes this EXPLICIT with <blob> codec
   - Added mediumblob → <blob> conversion (was missing)
   - Emphasized <npy@> and <object@> are NEW features, not migration targets

2. Phase I Step 4 (Configure Stores):
   - Added 'Skip this step if' for pipelines without legacy in-store
   - Only list LEGACY in-store formats (external-store, blob@store, etc.)
   - Removed <npy@> and <object@> from 'things to configure'
   - Added background explaining 0.14.x implicit vs 2.0 explicit codecs

3. AI Agent Prompt Updates:
   - Changed scope to 'Convert ALL legacy codecs' (not just 'all codecs')
   - Added explicit instruction: 'Do NOT add new 2.0 codecs'
   - Separated legacy in-store codecs from new 2.0 codecs
   - Added warning: 'IMPORTANT - Do NOT use these during migration'
   - Clarified these have NO legacy equivalent

4. Example Code:
   - Removed confusing RecordingEnhanced example with <npy@>
   - Added clear comment: 'Only convert existing legacy formats'
   - Noted Phase IV adoption is separate from migration

Key insight: 0.14.x did NOT have an explicit codec system. Types like
'longblob' automatically serialized Python objects. 2.0 makes this
explicit, but <npy@> and <object@> are entirely NEW capabilities.
User correction: There was NO 'external-store' type in legacy DataJoint.

Legacy in-store types were:
- blob@store (hash-addressed blobs)
- attach@store (hash-addressed attachments)
- filepath@store (filepath references)

Changes:
- Removed 'external-store' from codec migration table
- Removed 'external-raw' from examples (used 'blob@raw' instead)
- Updated Step 4 to list only actual legacy types
- Fixed AI agent prompt to remove external-store conversion
- Updated example code to show correct 0.14.x syntax (blob@raw, not external-raw)
- Fixed Phase III migration helper calls
User feedback: No need to distinguish 0.13 vs 0.14 family versions.

Changes:
- Replaced all '0.14.x' references with 'pre-2.0'
- Replaced all '0.14.6' references with 'pre-2.0' or 'legacy'
- Updated pip install example to use 'datajoint<2.0.0' (valid version constraint)
- Kept 'legacy' where it reads better contextually

This simplifies the guide and avoids confusion about which
specific pre-2.0 version the user might be on.
…ime, timestamp)

Added detailed guidance for special core types to match migrate-from-0x.md:

Background Section (User-facing):
- Split type conversions into clear categories:
  - Integer and Float Types
  - String, Date, and Structured Types
  - Codecs
- Added table for string/date/structured types with notes
- Included json, uuid, enum, datetime, timestamp, tinyint(1)
- Added important notes explaining:
  - Datetime/Timestamp: UTC-only in 2.0, convert timestamp → datetime
  - JSON: New core type, optional adoption
  - Enum: Already a core type, no changes needed

AI Agent Prompt (Detailed Instructions):
- Organized core types into logical groups
- Added 'Core Types (String and Date)' section
- Added 'Core Types (Structured Data)' for json and uuid
- Added 'Special Cases' for tinyint(1) and timestamp
- Included detailed 'IMPORTANT' sections for:
  - Datetime and Timestamp (UTC-only, conversion from timestamp)
  - Enum Types (no changes required)
  - JSON Type (optional adoption for JSON-in-blob migrations)
- Provided examples for each special case

This matches the level of detail in migrate-from-0x.md and ensures
AI agents properly handle these types during migration.
- Add detailed bool vs uint8 guidance (tinyint(1) ambiguity)
- Emphasize UTC-only datetime standard in DataJoint 2.0
- Clarify timezones handled by frontend, not database
- Fix json/uuid status (both existed in pre-2.0)
- Expand AI agent prompt with specific examples
- Add conversion decision trees for ambiguous types
- Legacy supported bool and boolean types (MySQL stores as tinyint(1))
- Only explicit tinyint(1) declarations need review
- Distinguish between bool (already present) vs tinyint(1) (ambiguous)
- Update table to show bool/boolean as unchanged
- Clarify in AI prompt: only tinyint(1) needs user decision
…one handling

- Add utf8mb4/utf8mb4_bin as server-wide requirements in system requirements table
- Explain character encoding is infrastructure configuration (like timezones)
- Clarify timezones handled by 'application front-ends and client APIs', not just 'frontend'
- Emphasize 'database stores UTC' throughout
- Update all timezone references for consistency
…re types

- Remove text and time from core types list
- Add text and time as native types (discouraged)
- text: recommend varchar(n) migration, or keep as native
- time: no core equivalent, keep as native if needed
- Add 'Core vs Native Types' explanation in Important Notes
- Update AI agent prompt with native types guidance
- Clarify json is a core type (was incorrectly called 'native')
- Add warnings that native types will generate warnings in 2.0
…r time type

Timestamp changes:
- ASK USER about timezone convention (don't assume UTC)
- Provide specific questions about timezone and MySQL auto-update behavior
- Invite adoption of UTC throughout pipeline
- Add example conversation showing interactive approach
- Recommend adding data conversion script to Phase III if needed

Time type changes:
- Recommend migrating time → datetime (core type)
- Ask user if date is also relevant before recommending datetime
- Allow keeping time as native type if only time-of-day needed
- Update AI agent prompt with interactive approach for both types

This ensures users understand their timezone conventions and make
deliberate decisions about conversion rather than automatic assumptions.
Fixed multiple instances where bullet lists immediately followed
section headers without blank lines, which breaks markdown rendering.

Affected sections:
- Conversion rules (datetime/timestamp and bool)
- 'Only explicit tinyint(1) declarations need review because:'
- 'For text:' and 'For time:' native type guidance
- CONTEXT, SCOPE, VERIFICATION, REPORT sections in AI prompts
- CONVERSIONS NEEDED section
@ operator changes:
- OLD: table1 @ table2 → join(table2, semantic_check=False)
- NEW: table1 @ table2 → table1 * table2 (WITH semantic checks)
- IMPORTANT: @ bypassed semantic checks; * enables them by default
- If semantic checks fail, INVESTIGATE—may reveal schema/data errors
- Add guidance for .join(x, left=True) → .extend(x)

fetch API changes:
- Add: table.fetch1('KEY') → table.keys()
- Add: table.fetch('KEY', 'a', 'b') → table.to_arrays('a', 'b', include_key=True)
- Update all examples and patterns
- Update VERIFICATION and REPORT sections
- Fix validation script example to use keys()

Rationale: The @ operator was a special case that bypassed semantic
checks. DataJoint 2.0 enables semantic checks by default with *, which
helps users discover schema errors during migration.
fetch API additions:
- Add: fetch(..., format='frame') → to_pandas()
- Add pattern example for pandas DataFrame conversion

dj.U() pattern removal:
- OLD: dj.U('attr') * table → dj.U('attr') & table
- NEW: dj.U('attr') * table → table (no longer necessary)
- Updated all references: table, background, AI prompt, patterns, REPORT
- Pattern 8 renamed to 'Universal set (REMOVE)'

ERD deprecation:
- Add: dj.ERD(schema) → dj.Diagram(schema)
- ERD is deprecated in DataJoint 2.0
- Added to API comparison table and background section

Checklist updates:
- Add fetch(..., format='frame') check
- Add fetch1('KEY') check
- Add dj.U() * table removal check
- Add dj.ERD() conversion check
- Renumber pattern examples (was duplicate Pattern 5)
CORRECTION: Previous commit incorrectly stated dj.U() * table should be
removed entirely. This was wrong.

Correct understanding:
- dj.U('attr') & table → CORRECT pattern, remains unchanged
  Used to project specific attributes (e.g., all unique dates)
  Example: all_dates = dj.U('session_date') & Session

- dj.U('attr') * table → HACK pattern, needs refactoring
  Was used to magically change primary key of table
  Should be flagged and user asked to refactor

Changes:
- Add both patterns to API comparison table
- Split into separate 'Universal Set' section in background
- Update AI agent prompt to distinguish correct from hack
- Update PROCESS to 'identify as hack, ask user to refactor'
- Update VERIFICATION to check both patterns separately
- Update Pattern 10 to show both correct and hack examples
- Update REPORT to count both patterns separately
- Update commit message format
- Update Phase I checklist

This ensures users understand:
1. dj.U() & table is correct and should remain
2. dj.U() * table was a hack and needs attention
@dimitri-yatsenko dimitri-yatsenko merged commit 0dc461f into main Jan 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants