diff --git a/AGENTS.md b/AGENTS.md index dbf89a033dff..c26bdfbc1d87 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -19,6 +19,25 @@ - Loader/home: Build outputs publish a "monitoring home"; the native loader boots the tracer from there. - Build system: Nuke coordinates .NET builds and CMake/vcpkg for native components. +## NuGet Package Architecture + +### Datadog.Trace Package +The `Datadog.Trace` NuGet package provides the **manual instrumentation API** for customers: +- **Contains**: `Datadog.Trace.Manual.dll` - Public API for manual instrumentation +- **Does NOT contain**: Auto-instrumentation code or native profiler binaries +- **Usage**: Reference in application code for manual tracing (e.g., `Tracer.Instance.StartActive()`) + +Auto-instrumentation comes from the tracer "monitoring home" deployed separately (via installers, MSI, container images, or specialized packages like `Datadog.AzureFunctions`). + +### Datadog.Trace.dll vs Datadog.Trace.Manual.dll +- `Datadog.Trace.dll` - The full managed tracer with all auto-instrumentation code, loaded by the native profiler into instrumented processes +- `Datadog.Trace.Manual.dll` - Lightweight manual instrumentation API packaged in the `Datadog.Trace` NuGet package for customer reference + +### Specialized Packages +- **Datadog.AzureFunctions**: Bundles `Datadog.Trace.dll` and native profiler for Azure Functions (see `docs/development/AzureFunctions.md`) +- **Datadog.Monitoring.Distribution**: MSI installer for Windows (IIS, Windows Services) +- Other serverless/platform-specific packages may bundle the full tracer similarly + ## Tracer Structure - `tracer/src/Datadog.Trace` β€” Core managed tracer library @@ -199,10 +218,14 @@ tracer/src/Datadog.Trace - **Setup**: Use Azure App Services Site Extension on Windows Premium/Elastic Premium/Dedicated plans; use `Datadog.AzureFunctions` NuGet package for Linux Consumption/Container Apps - **Tests**: `BuildAndRunWindowsAzureFunctionsTests` Nuke target; samples under `tracer/test/test-applications/azure-functions/` - **Dependencies**: `Datadog.AzureFunctions` β†’ `Datadog.Serverless.Compat` ([datadog-serverless-compat-dotnet](https://github.com/DataDog/datadog-serverless-compat-dotnet)) contains agent executable +- **External Repos**: [Azure Functions Host](https://github.com/Azure/azure-functions-host) and [.NET Worker](https://github.com/Azure/azure-functions-dotnet-worker) πŸ“– **Load when**: Working on Azure Functions instrumentation or debugging serverless issues - **`docs/development/AzureFunctions.md`** β€” In-process vs isolated worker models, instrumentation specifics, ASP.NET Core integration, GRPC context propagation, and debugging guide +πŸ“– **Load when**: Need detailed architectural understanding of Azure Functions internals +- **`docs/development/AzureFunctions-Architecture.md`** β€” Deep dive into Azure Functions Host and .NET Worker architecture, gRPC protocol, middleware model, distributed tracing integration, environment variables, and instrumentation hook points + πŸ“– **Load when**: Working on AWS Lambda or general serverless instrumentation - **`docs/development/Serverless.md`** β€” Serverless instrumentation patterns across cloud providers @@ -292,8 +315,10 @@ The tracer runs in-process with customer applications and must have minimal perf - `docs/development/AutomaticInstrumentation.md` β€” Creating integrations - `docs/development/DuckTyping.md` β€” Duck typing guide - `docs/development/AzureFunctions.md` β€” Azure Functions integration +- `docs/development/AzureFunctions-Architecture.md` β€” Azure Functions architecture deep dive - `docs/development/Serverless.md` β€” Serverless instrumentation - `docs/development/UpdatingTheSdk.md` β€” SDK updates +- `docs/development/QueryingDatadogAPIs.md` β€” Querying Datadog APIs for debugging (spans, logs) **CI & Testing:** - `docs/development/CI/TroubleshootingCIFailures.md` β€” Investigating build/test failures in Azure DevOps diff --git a/docs/development/AzureFunctions-Architecture.md b/docs/development/AzureFunctions-Architecture.md new file mode 100644 index 000000000000..332efeae2961 --- /dev/null +++ b/docs/development/AzureFunctions-Architecture.md @@ -0,0 +1,922 @@ +# Azure Functions Architecture Deep Dive + +This document provides detailed architectural information about Azure Functions Host and .NET Worker, focusing on aspects relevant to dd-trace-dotnet instrumentation and distributed tracing. + +**Related Documentation:** +- [Azure Functions Integration Guide](AzureFunctions.md) - Setup, testing, and instrumentation specifics for dd-trace-dotnet +- [AGENTS.md](../../AGENTS.md) - Repository structure and development guidelines + +**External Resources:** +- [Azure Functions Host Repository](https://github.com/Azure/azure-functions-host) +- [Azure Functions .NET Worker Repository](https://github.com/Azure/azure-functions-dotnet-worker) + +--- + +## Table of Contents + +1. [Architecture Overview](#architecture-overview) +2. [Azure Functions Host](#azure-functions-host) +3. [Azure Functions .NET Worker](#azure-functions-net-worker) +4. [gRPC Communication Protocol](#grpc-communication-protocol) +5. [Distributed Tracing Integration](#distributed-tracing-integration) +6. [Environment Variables and Configuration](#environment-variables-and-configuration) +7. [Instrumentation Hook Points](#instrumentation-hook-points) + +--- + +## Architecture Overview + +Azure Functions uses a **host-worker architecture** where: +- **Host**: Manages triggers, bindings, scaling, and routes invocations +- **Worker**: Executes user function code in an isolated process +- **Communication**: Bidirectional gRPC streaming between host and worker + +### Execution Models + +#### In-Process (.NET) +- Function code runs in the same process as the host +- Direct assembly loading and execution +- Tightly coupled to host runtime version +- Limited to .NET version supported by host + +#### Isolated Worker (.NET Isolated) +- Function code runs in a separate process +- gRPC communication between processes +- Independent .NET version from host +- Supports middleware and full DI control + +**dd-trace-dotnet supports both models** but uses different instrumentation strategies. + +--- + +## Azure Functions Host + +**Repository**: [azure-functions-host](https://github.com/Azure/azure-functions-host) +**Solution**: `WebJobs.Script.sln` + +### Core Components + +#### 1. ScriptHost +**Location**: `src/WebJobs.Script/Host/ScriptHost.cs` + +The ScriptHost is the central orchestrator that manages the function execution lifecycle. + +**Key Responsibilities**: +- Function metadata management and validation +- Worker channel initialization and lifecycle +- Extension and binding provider coordination +- Function descriptor generation +- Primary/secondary host state management + +**Initialization Flow**: +1. **PreInitialize()** (lines 429-473): + - Validates `FUNCTIONS_EXTENSION_VERSION` + - Validates `FUNCTIONS_WORKER_RUNTIME` + - Logs Application Insights and OpenTelemetry status + - Initializes file system directories + +2. **InitializeAsync()** (lines 277-329): + - Generates function metadata + - Resolves worker runtime + - Creates function descriptors + - Initializes function dispatcher + - Generates IL function wrappers + +3. **StartAsyncCore()** (lines 256-271): + - Logs initialization metrics + - Starts the JobHost + - Logs function errors + +#### 2. RpcFunctionInvocationDispatcher +**Location**: `src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs` + +Manages RPC-based worker processes and routes invocations. + +**Key Methods**: +- **InitializeAsync()** (lines 255-367): + - Resolves worker runtime from `FUNCTIONS_WORKER_RUNTIME` or metadata + - Validates worker configuration + - Starts worker processes + - Initializes worker channels + +- **InvokeAsync()** (lines 412-434): + - Gets initialized worker channels for function's language + - Uses load balancer to select channel + - Posts invocation to function's input buffer + +- **DisposeAndRestartWorkerChannel()** (lines 526-578): + - Shuts down failed worker + - Restarts if needed based on error threshold + +#### 3. GrpcWorkerChannel +**Location**: `src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs` + +Represents a single gRPC communication channel to a worker process. + +**Lifecycle**: +1. **Constructor** (lines 98-160): Sets up channels, metrics, event subscriptions +2. **StartWorkerProcessAsync()** (lines 373-383): Starts process and waits for init +3. **SendWorkerInitRequest()** (lines 422-453): Sends capabilities to worker +4. **WorkerInitResponse()** (lines 493-521): Receives worker capabilities +5. **SendFunctionLoadRequests()** (lines 606-640): Loads functions into worker +6. **SendInvocationRequest()** (lines 852-920): Sends invocation with trace context +7. **InvokeResponse()** (lines 1100-1152): Processes invocation result + +### Worker Process Management + +#### Process Configuration +**Location**: `src/WebJobs.Script/Workers/ProcessManagement/WorkerProcessCountOptions.cs` + +Default values: +- `ProcessCount`: 1 +- `MaxProcessCount`: 10 +- `ProcessStartupInterval`: 10 seconds +- `ProcessStartupTimeout`: 60 seconds +- `InitializationTimeout`: 10 seconds +- `EnvironmentReloadTimeout`: 30 seconds + +#### Process Pool Management +- Dynamic concurrency: Max 1 worker +- Regular mode: Up to `MaxProcessCount` workers +- Error threshold: 3 Γ— MaxProcessCount +- Restart interval: 5 minutes between restarts + +### Extension System + +#### Extension Loading +Extensions are loaded through: +1. **IScriptBindingProvider**: Binding providers registered via DI +2. **Extension Bundles**: Versioned extension packages via `IExtensionBundleManager` +3. **Function Descriptor Providers**: + - `DotNetFunctionDescriptorProvider`: In-proc .NET functions + - `RpcFunctionDescriptorProvider`: Out-of-proc language workers + - `HttpFunctionDescriptorProvider`: HTTP workers + - `CodelessFunctionDescriptorProvider`: Pre-compiled assemblies + +### gRPC Server + +#### FunctionRpcService +**Location**: `src/WebJobs.Script.Grpc/Server/FunctionRpcService.cs` + +Implements the gRPC service for bidirectional streaming: + +**EventStream()** (lines 33-101): +- Waits for `StartStream` message with worker ID +- Retrieves inbound/outbound channels from `ScriptEventManager` +- Spawns `PushFromOutboundToGrpc()` background task +- Loops reading from gRPC and writing to inbound channel + +**Architecture**: +- Uses `System.Threading.Channels` for internal message queuing +- `InboundGrpcEvent`: Messages from worker to host +- `OutboundGrpcEvent`: Messages from host to worker + +--- + +## Azure Functions .NET Worker + +**Repository**: [azure-functions-dotnet-worker](https://github.com/Azure/azure-functions-dotnet-worker) +**Main Solution**: `DotNetWorker.sln` + +### Core Components + +#### 1. FunctionsApplication +**Location**: `src/DotNetWorker.Core/FunctionsApplication.cs` + +Core orchestrator that manages function execution. + +**Key Responsibilities**: +- Maintains function definition registry (`_functionMap`) +- Creates `FunctionContext` for each invocation +- Orchestrates middleware pipeline execution +- Manages distributed tracing via `Activity` + +**Key Methods**: +- **CreateContext()** (line 43): Creates `FunctionContext` from invocation features +- **LoadFunction()** (line 53): Registers function definitions +- **InvokeFunctionAsync()** (lines 68-109): Executes middleware pipeline with tracing + - Parses W3C TraceContext from `context.TraceContext` + - Creates internal `Activity` if needed + - Starts OpenTelemetry Activity + - Executes middleware pipeline + - Handles exceptions and sets Activity status + +#### 2. GrpcWorker +**Location**: `src/DotNetWorker.Grpc/GrpcWorker.cs` + +Implements bidirectional gRPC streaming communication with host. + +**Message Types Handled**: +- `InvocationRequest` β†’ Delegates to `InvocationHandler` +- `FunctionLoadRequest` β†’ Loads function definitions +- `WorkerInitRequest` β†’ Reports capabilities and metadata +- `FunctionsMetadataRequest` β†’ Provides function indexing metadata +- `InvocationCancel` β†’ Cancellation token management +- `WorkerTerminate` β†’ Graceful shutdown + +**StartAsync()** (lines 54-58): +1. Initiates gRPC `EventStream` call +2. Sends `StartStream` message with worker ID +3. Starts writer task for responses +4. Starts reader task for requests + +#### 3. InvocationHandler +**Location**: `src/DotNetWorker.Grpc/Handlers/InvocationHandler.cs` + +Processes invocation requests and manages invocation lifecycle. + +**InvokeAsync() Flow** (lines 54-142): +1. Create `CancellationTokenSource` and track invocation +2. Create `GrpcFunctionInvocation` from request +3. Build `IInvocationFeatures` collection +4. Create `FunctionContext` via `FunctionsApplication` +5. Add `IFunctionBindingsFeature` with gRPC binding data +6. Invoke function via `_application.InvokeFunctionAsync(context)` +7. Serialize output bindings and return value +8. Clean up resources and dispose context + +### Middleware Model + +The worker uses an ASP.NET Core-style middleware pattern. + +#### Middleware Interface +**Location**: `src/DotNetWorker.Core/Pipeline/IFunctionsWorkerMiddleware.cs` + +```csharp +public interface IFunctionsWorkerMiddleware +{ + Task Invoke(FunctionContext context, FunctionExecutionDelegate next); +} +``` + +#### Pipeline Builder +**Location**: `src/DotNetWorker.Core/Pipeline/DefaultInvocationPipelineBuilder.cs` + +Middleware is built using reverse aggregation (lines 24-33): +```csharp +FunctionExecutionDelegate pipeline = context => Task.CompletedTask; +pipeline = _middlewareCollection.Reverse().Aggregate(pipeline, (p, d) => d(p)); +``` + +#### Default Middleware Order +**Location**: `src/DotNetWorker.Core/Hosting/WorkerMiddlewareWorkerApplicationBuilderExtensions.cs` + +1. **Custom Middleware** (registered first) +2. **OutputBindingsMiddleware** (line 35): Binds output data from function result +3. **FunctionExecutionMiddleware** (line 36): Delegates to `IFunctionExecutor.ExecuteAsync()` + +#### Middleware Registration +**Location**: `src/DotNetWorker.Core/Hosting/WorkerMiddlewareWorkerApplicationBuilderExtensions.cs` + +```csharp +// Register typed middleware +builder.UseMiddleware(); + +// Register inline middleware +builder.UseMiddleware(async (FunctionContext context, FunctionExecutionDelegate next) => { + // Pre-execution logic + await next(context); + // Post-execution logic +}); + +// Conditional middleware +builder.UseWhen(context => + context.FunctionDefinition.Name == "MyFunction"); +``` + +### Function Execution Pipeline + +#### FunctionExecutor +**Location**: `src/DotNetWorker.Core/Invocation/DefaultFunctionExecutor.cs` + +**ExecuteAsync Flow** (lines 25-50): +1. Get or create `IFunctionInvoker` from cache +2. Create function instance via `IFunctionActivator` +3. Bind input parameters via `IFunctionInputBindingFeature` +4. Invoke the method and store result + +#### FunctionActivator +**Location**: `src/DotNetWorker.Core/Invocation/DefaultFunctionActivator.cs` + +Uses `ActivatorUtilities.CreateInstance()` for constructor injection: +```csharp +return ActivatorUtilities.CreateInstance( + context.InstanceServices, + instanceType, + Array.Empty()); +``` + +### FunctionContext + +**Location**: `src/DotNetWorker.Core/Context/FunctionContext.cs` + +Abstract class providing access to invocation data. + +**Key Properties**: +- `InvocationId`: Unique identifier for invocation +- `FunctionId`: Stable identifier for function +- `TraceContext`: Distributed tracing context (TraceParent, TraceState) +- `BindingContext`: Access to binding data +- `RetryContext`: Retry information +- `InstanceServices`: Scoped service provider for DI +- `FunctionDefinition`: Function metadata +- `Items`: Request-scoped key/value storage +- `Features`: Extensible feature collection +- `CancellationToken`: Invocation cancellation token + +#### Accessing Bindings +**Extension Method**: `context.GetBindings()` + +Returns `IFunctionBindingsFeature`: +- `TriggerMetadata`: Read-only dictionary with trigger metadata +- `InputData`: Read-only dictionary with input binding data +- `OutputBindingData`: Mutable dictionary for output bindings +- `InvocationResult`: Function return value + +### Startup and Initialization + +#### StartupHook +**Location**: `src/DotNetWorker.Core/StartupHook.cs` + +Uses .NET's startup hook feature for early initialization. + +**Actions** (lines 26-72): +1. **Self-Removal**: Removes itself from `DOTNET_STARTUP_HOOKS` environment variable +2. **Debugger Wait**: If `FUNCTIONS_ENABLE_DEBUGGER_WAIT=true`, waits for debugger +3. **JSON Output**: If `FUNCTIONS_ENABLE_JSON_OUTPUT=true`, emits startup log + +#### Worker Initialization Sequence +1. `StartupHook.Initialize()` - Early initialization (before Main) +2. `Program.Main()` - User code creates `IHostBuilder` +3. `ConfigureFunctionsWorkerDefaults()` - Registers worker services +4. `AddFunctionsWorkerCore()` - Core service registration +5. `AddGrpc()` - gRPC services registration +6. `GrpcWorker.StartAsync()` - Establishes gRPC connection + +### Service Registration + +#### Core Services +**Location**: `src/DotNetWorker.Core/Hosting/ServiceCollectionExtensions.cs` (lines 36-118) + +**Request Handling**: +- `IFunctionsApplication` β†’ `FunctionsApplication` + +**Function Execution**: +- `IMethodInfoLocator` β†’ `DefaultMethodInfoLocator` +- `IFunctionInvokerFactory` β†’ `DefaultFunctionInvokerFactory` +- `IMethodInvokerFactory` β†’ `DefaultMethodInvokerFactory` +- `IFunctionActivator` β†’ `DefaultFunctionActivator` +- `IFunctionExecutor` β†’ `DefaultFunctionExecutor` + +**Context Management**: +- `IFunctionContextFactory` β†’ `DefaultFunctionContextFactory` +- `IInvocationFeaturesFactory` β†’ `DefaultInvocationFeaturesFactory` + +**Diagnostics**: +- `FunctionActivitySourceFactory` +- `ILoggerProvider` β†’ `WorkerLoggerProvider` + +#### gRPC Services +**Location**: `src/DotNetWorker.Grpc/GrpcServiceCollectionExtensions.cs` (lines 38-76) + +- `GrpcHostChannel`: Unbounded channel for output messages +- `GrpcFunctionsHostLogWriter`: Logging integration +- `IWorker` β†’ `GrpcWorker` +- `IInvocationHandler` β†’ `InvocationHandler` +- `IWorkerClientFactory` β†’ `GrpcWorkerClientFactory` + +--- + +## gRPC Communication Protocol + +**Protocol Definition**: `azure-functions-language-worker-protobuf/src/proto/FunctionRpc.proto` + +### Service Definition + +```protobuf +service FunctionRpc { + rpc EventStream (stream StreamingMessage) returns (stream StreamingMessage) {} +} +``` + +Bidirectional streaming with multiplexed message types. + +### Key Message Types + +#### StreamingMessage +**Lines 21-97** + +Wrapper for all bidirectional messages with `oneof content`: +- `StartStream`: Worker initiates connection +- `WorkerInitRequest/Response`: Initialization handshake +- `WorkerStatusRequest/Response`: Health checks +- `FunctionLoadRequest/Response`: Function loading +- `InvocationRequest/Response`: Function execution +- `InvocationCancel`: Cancel running invocation +- `RpcLog`: Worker logs +- `FunctionEnvironmentReloadRequest/Response`: Specialization +- `FunctionsMetadataRequest/FunctionMetadataResponse`: Worker indexing + +#### InvocationRequest +**Lines 379-397** + +```protobuf +message InvocationRequest { + string invocation_id = 1; + string function_id = 2; + repeated ParameterBinding input_data = 3; + map trigger_metadata = 4; + RpcTraceContext trace_context = 6; + RetryContext retry_context = 7; + // ... +} +``` + +#### InvocationResponse +**Lines 433-445** + +```protobuf +message InvocationResponse { + string invocation_id = 1; + repeated ParameterBinding output_data = 2; + TypedData return_value = 4; + StatusResult result = 3; + // ... +} +``` + +#### RpcTraceContext +**Lines 400-409** + +```protobuf +message RpcTraceContext { + string trace_parent = 1; // W3C traceparent header + string trace_state = 2; // W3C tracestate header + map attributes = 3; // Additional tags +} +``` + +#### TypedData +**Lines 457-473** + +Polymorphic data container supporting: +- `string`, `json`, `bytes`, `stream` +- `http`: `RpcHttp` with headers, body, cookies +- Numeric: `int`, `double` +- Collections: `bytes`, `string`, `double`, `sint64` +- `model_binding_data`: SDK-type bindings + +### Communication Flow + +#### Initialization +1. **Host starts worker process** with command-line arguments: + - `--functions-uri`: gRPC endpoint + - `--functions-worker-id`: Worker identifier + - `--functions-request-id`: Request identifier + +2. **Worker connects** to gRPC endpoint and sends `StartStream` message + +3. **Host sends `WorkerInitRequest`** with host capabilities: + - Shared memory support + - Function data cache + - V2 compatibility flags + +4. **Worker responds with `WorkerInitResponse`** with worker capabilities: + - Cancellation support + - Worker termination handling + - Raw HTTP body support + +5. **Host sends `FunctionLoadRequest`** for each function + +6. **Worker responds with `FunctionLoadResponse`** for each function + +#### Invocation +1. **Host sends `InvocationRequest`**: + - Function ID + - Input bindings + - Trigger metadata + - **Trace context** (W3C TraceContext) + +2. **Worker processes invocation**: + - Parses trace context + - Creates Activity + - Executes middleware pipeline + - Invokes function + +3. **Worker sends `InvocationResponse`**: + - Output bindings + - Return value + - Success/failure status + +#### Cancellation +1. **Host sends `InvocationCancel`** with invocation ID +2. **Worker cancels CancellationToken** for invocation +3. **Worker sends `InvocationResponse`** with `Cancelled` status + +--- + +## Distributed Tracing Integration + +### W3C Trace Context Support + +Both host and worker support W3C Trace Context propagation. + +#### Host Side + +**Location**: `src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs` + +**SendInvocationRequest()** (lines 852-920): +- Adds distributed tracing context to `InvocationRequest` +- Sets `trace_context.trace_parent` from `Activity.Current?.Id` +- Sets `trace_context.trace_state` from `Activity.Current?.TraceStateString` + +**AddAdditionalTraceContext()** (lines 1684-1728): +- Adds tags to `Activity.Current` from gRPC `RpcTraceContext.Attributes` +- Called when receiving `InvocationResponse` + +#### Worker Side + +**Location**: `src/DotNetWorker.Core/FunctionsApplication.cs` + +**InvokeFunctionAsync()** (lines 68-109): +1. **Parse W3C TraceContext** (lines 78-86): + ```csharp + if (ActivityContext.TryParse(context.TraceContext.TraceParent, + context.TraceContext.TraceState, + out var activityContext)) + { + var activity = new Activity("InboundRequest"); + activity.SetParentId(activityContext.TraceId, activityContext.SpanId, activityContext.TraceFlags); + activity.TraceStateString = activityContext.TraceState; + activity.Start(); + } + ``` + +2. **Start OpenTelemetry Activity** (line 92): + ```csharp + using var activity = _functionActivitySource.StartInvoke(context); + ``` + +3. **Execute middleware pipeline** (line 96): + ```csharp + await _functionExecutionDelegate(context); + ``` + +4. **Set Activity status on exception** (lines 98-105) + +### Activity Integration + +#### FunctionActivitySourceFactory +**Location**: `src/DotNetWorker.Core/Diagnostics/FunctionActivitySourceFactory.cs` + +**ActivitySource**: `"Microsoft.Azure.Functions.Worker"` version `"1.0.0.0"` + +**StartInvoke()** (lines 24-37): +- Creates Activity with `ActivityKind.Server` +- Adds tags: `az.schema_url`, `faas.execution` (invocation ID) +- Uses OpenTelemetry 1.17.0 schema + +### Application Insights Integration + +The host logs Application Insights status during initialization. + +**Location**: `src/WebJobs.Script/Host/ScriptHost.cs` (PreInitialize) + +Checks for: +- `APPLICATIONINSIGHTS_CONNECTION_STRING` +- `APPINSIGHTS_INSTRUMENTATIONKEY` +- `APPLICATIONINSIGHTS_ENABLE_AGENT` + +### OpenTelemetry Support + +The host logs OpenTelemetry status during initialization. + +**Location**: `src/WebJobs.Script/Host/ScriptHost.cs` (PreInitialize) + +Checks for: +- `OTEL_EXPORTER_OTLP_ENDPOINT` + +--- + +## Environment Variables and Configuration + +### Host Environment Variables + +**Location**: `src/WebJobs.Script/Environment/EnvironmentSettingNames.cs` + +#### Core Runtime Settings +- `FUNCTIONS_EXTENSION_VERSION`: Functions runtime version (e.g., "~4") +- `FUNCTIONS_WORKER_RUNTIME`: Language runtime (node, python, dotnet-isolated) +- `FUNCTIONS_WORKER_RUNTIME_VERSION`: Specific runtime version +- `AzureWebJobsScriptRoot`: Function app root path + +#### Worker Process Configuration +- `FUNCTIONS_WORKER_PROCESS_COUNT`: Number of worker processes +- `FUNCTIONS_WORKER_SHARED_MEMORY_DATA_TRANSFER_ENABLED`: Enable shared memory +- `FUNCTIONS_UNIX_SHARED_MEMORY_DIRECTORIES`: Custom shared memory directories +- `FUNCTIONS_WORKER_DYNAMIC_CONCURRENCY_ENABLED`: Enable dynamic concurrency + +#### Placeholder/Specialization +- `WEBSITE_PLACEHOLDER_MODE`: Running in placeholder mode +- `WEBSITE_USE_PLACEHOLDER_DOTNETISOLATED`: Use .NET isolated placeholder +- `INITIALIZED_FROM_PLACEHOLDER`: Specialized from placeholder +- `WEBSITE_CONTAINER_READY`: Container ready for specialization + +#### Azure App Service +- `WEBSITE_SITE_NAME`: Site name +- `WEBSITE_HOSTNAME`: Hostname +- `WEBSITE_INSTANCE_ID`: Instance ID +- `WEBSITE_SKU`: SKU (Dynamic, ElasticPremium, etc.) +- `WEBSITE_OWNER_NAME`: Subscription/resource group info + +#### Telemetry +- `APPLICATIONINSIGHTS_CONNECTION_STRING`: App Insights connection string +- `APPINSIGHTS_INSTRUMENTATIONKEY`: Legacy instrumentation key +- `APPLICATIONINSIGHTS_ENABLE_AGENT`: Enable App Insights agent +- `OTEL_EXPORTER_OTLP_ENDPOINT`: OpenTelemetry endpoint + +#### Storage +- `AzureWebJobsStorage`: Storage account connection string +- `WEBSITE_CONTENTAZUREFILECONNECTIONSTRING`: Azure Files connection +- `WEBSITE_CONTENTSHARE`: Content share name + +#### Run from Package +- `WEBSITE_RUN_FROM_PACKAGE`: Run from package URL or path +- `WEBSITE_RUN_FROM_PACKAGE_BLOB_MI_RESOURCE_ID`: Managed identity for blob + +### Worker Environment Variables + +**Location**: `src/DotNetWorker.Core/StartupHook.cs` and `src/DotNetWorker.Grpc/GrpcServiceCollectionExtensions.cs` + +#### Startup Hook Variables +- `DOTNET_STARTUP_HOOKS`: Automatically managed by worker +- `FUNCTIONS_ENABLE_DEBUGGER_WAIT`: Wait for debugger attach at startup +- `FUNCTIONS_ENABLE_JSON_OUTPUT`: Emit JSON startup logs + +#### Worker Configuration Variables +- `AZURE_FUNCTIONS_*`: Host configuration (automatically loaded as config prefix) +- `Functions:Worker:HostEndpoint`: gRPC endpoint URI +- `Functions:Worker:WorkerId`: Worker identifier +- `Functions:Worker:RequestId`: Initial request identifier +- `Functions:Worker:GrpcMaxMessageLength`: Max message size + +#### Application Directory Variables +- `FUNCTIONS_WORKER_DIRECTORY`: Worker directory (fallback) +- `FUNCTIONS_APPLICATION_DIRECTORY`: Application directory (preferred) + +#### Native Host Integration +- `AZURE_FUNCTIONS_NATIVE_HOST`: AppContext data for native host integration + +### Configuration Precedence + +**Host Configuration Hierarchy** (highest to lowest): +1. Environment Variables +2. Application Settings (exposed as environment variables) +3. Host.json +4. Worker Config Files (worker.config.json) +5. Platform Defaults + +**Worker Configuration Hierarchy** (highest to lowest): +1. Command Line Arguments (with switch mappings) +2. Environment Variables (all) +3. AZURE_FUNCTIONS_ Prefixed Environment Variables +4. Default Values (in code) + +### Host.json Settings + +Key host.json sections affecting workers: + +#### languageWorkers +```json +{ + "languageWorkers": { + "workersDirectory": "workers" + } +} +``` + +#### Worker-specific sections +```json +{ + "node": { + "maxWorkerProcessCount": 5 + }, + "dotnet-isolated": { + "maxWorkerProcessCount": 3 + } +} +``` + +#### functionTimeout +Maximum execution time (affects worker behavior) + +#### extensions +Extension-specific configuration: +- Worker concurrency settings +- Shared memory configuration +- Worker indexing flags + +#### logging +Log levels affect `RpcLog` verbosity + +--- + +## Instrumentation Hook Points + +### For Host-Side Instrumentation + +#### 1. Worker Process Startup +**Location**: `src/WebJobs.Script/Workers/ProcessManagement/WorkerProcess.cs` + +**Hook Point**: Environment variables can be injected before `Process.Start()` + +**Use Case**: Inject tracer configuration for worker process + +#### 2. gRPC Message Interception +**Location**: `src/WebJobs.Script.Grpc/Server/FunctionRpcService.cs` + +**Hook Point**: `EventStream()` sees all bidirectional messages + +**Use Case**: Inspect or modify trace context in `InvocationRequest`/`InvocationResponse` + +#### 3. Invocation Context +**Location**: `src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs` + +**Hook Points**: +- `SendInvocationRequest()` (line 892-893): Inject/extract trace context +- `InvokeResponse()` (line 1100-1152): Process response with trace context + +**Use Case**: Add tags to `Activity.Current` from invocation metadata + +#### 4. Distributed Tracing +**Location**: `src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs` + +**Hook Point**: `AddAdditionalTraceContext()` (lines 1684-1728) + +**Use Case**: Add tags to Activity.Current from RpcTraceContext + +### For Worker-Side Instrumentation + +#### 1. Middleware (Primary Hook Point) +**Location**: User code via `IFunctionsWorkerApplicationBuilder` + +**Hook Point**: Register middleware in `Program.cs` + +**Use Case**: Pre/post function execution logic, distributed tracing, error handling + +**Example**: +```csharp +public class DatadogTracingMiddleware : IFunctionsWorkerMiddleware +{ + public async Task Invoke(FunctionContext context, FunctionExecutionDelegate next) + { + // Extract trace context + var traceParent = context.TraceContext.TraceParent; + var traceState = context.TraceContext.TraceState; + + // Create span + using (var scope = /* create Datadog scope */) + { + try + { + await next(context); + } + catch (Exception ex) + { + // Report exception to span + throw; + } + } + } +} +``` + +#### 2. IInvocationFeatures - Feature Injection +**Location**: `src/DotNetWorker.Core/Context/Features/IInvocationFeatures.cs` + +**Hook Point**: Add custom features via `context.Features.Set(instance)` + +**Use Case**: Attach custom tracing data to invocation + +#### 3. FunctionContext.Items - Request State +**Location**: `src/DotNetWorker.Core/Context/FunctionContext.cs` + +**Hook Point**: Store custom data in `context.Items` dictionary + +**Use Case**: Share tracing state across middleware + +#### 4. Activity and ActivitySource Integration +**Location**: `src/DotNetWorker.Core/Diagnostics/FunctionActivitySourceFactory.cs` + +**Hook Point**: Subscribe to `ActivitySource` with `ActivityListener` + +**Use Case**: Intercept OpenTelemetry Activities for custom processing + +**ActivitySource Name**: `"Microsoft.Azure.Functions.Worker"` + +#### 5. IFunctionActivator - Instance Creation +**Location**: `src/DotNetWorker.Core/Invocation/IFunctionActivator.cs` + +**Hook Point**: Replace default activator to inject custom logic + +**Use Case**: Inject profiling or tracing logic during instance creation + +#### 6. StartupHook - Early Initialization +**Location**: `src/DotNetWorker.Core/StartupHook.cs` + +**Hook Point**: Add custom startup hook via `DOTNET_STARTUP_HOOKS` + +**Use Case**: Initialize tracer before worker starts + +### Recommended Instrumentation Strategy for dd-trace-dotnet + +#### In-Process Model +1. Use CallTarget instrumentation for Azure Functions SDK types +2. Hook into JobHost invocation pipeline +3. Instrument binding-specific types (HTTP, Queue, etc.) + +#### Isolated Worker Model +1. **Primary**: Use middleware for function invocation tracing +2. **Secondary**: Subscribe to ActivitySource for OpenTelemetry integration +3. **Advanced**: Use automatic instrumentation for gRPC client calls +4. **Context**: Extract W3C TraceContext from `FunctionContext.TraceContext` +5. **Propagation**: Inject trace headers into output bindings + +### Key Integration Points for Datadog + +#### Trace Context Extraction +**Worker Side**: `FunctionContext.TraceContext.TraceParent` and `TraceContext.TraceState` + +Extract using W3C Trace Context format: +- TraceParent: `00---` +- TraceState: Comma-separated list vendor state + +#### Activity Integration +**Worker Side**: `FunctionsApplication.InvokeFunctionAsync()` already creates Activity + +Datadog should: +1. Subscribe to ActivitySource `"Microsoft.Azure.Functions.Worker"` +2. Extract trace/span IDs from `Activity.Current` +3. Propagate Datadog-specific headers in `Activity.Baggage` or custom tags + +#### Span Attributes +Recommended span attributes: +- `faas.trigger`: Trigger type (http, queue, timer, etc.) +- `faas.execution`: Invocation ID from `context.InvocationId` +- `faas.name`: Function name from `context.FunctionDefinition.Name` +- `cloud.provider`: "azure" +- `cloud.platform`: "azure_functions" +- `cloud.region`: From environment variables + +#### Error Handling +Capture exceptions from: +1. **Middleware**: Catch exceptions in middleware +2. **Activity**: Set `Activity.Status` to `Error` +3. **Context**: Check `context.GetBindings().InvocationResult` for errors + +--- + +## Additional Resources + +### Source Code Locations + +#### Host Repository +- **ScriptHost**: `src/WebJobs.Script/Host/ScriptHost.cs` +- **RpcFunctionInvocationDispatcher**: `src/WebJobs.Script/Workers/Rpc/FunctionRegistration/RpcFunctionInvocationDispatcher.cs` +- **GrpcWorkerChannel**: `src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs` +- **FunctionRpcService**: `src/WebJobs.Script.Grpc/Server/FunctionRpcService.cs` +- **Environment Settings**: `src/WebJobs.Script/Environment/EnvironmentSettingNames.cs` +- **RpcWorkerConstants**: `src/WebJobs.Script/Workers/Rpc/RpcWorkerConstants.cs` + +#### Worker Repository +- **FunctionsApplication**: `src/DotNetWorker.Core/FunctionsApplication.cs` +- **GrpcWorker**: `src/DotNetWorker.Grpc/GrpcWorker.cs` +- **InvocationHandler**: `src/DotNetWorker.Grpc/Handlers/InvocationHandler.cs` +- **FunctionContext**: `src/DotNetWorker.Core/Context/FunctionContext.cs` +- **Middleware Extensions**: `src/DotNetWorker.Core/Hosting/WorkerMiddlewareWorkerApplicationBuilderExtensions.cs` +- **FunctionActivitySourceFactory**: `src/DotNetWorker.Core/Diagnostics/FunctionActivitySourceFactory.cs` +- **StartupHook**: `src/DotNetWorker.Core/StartupHook.cs` + +#### Protocol Definitions +- **FunctionRpc.proto**: `azure-functions-language-worker-protobuf/src/proto/FunctionRpc.proto` (in both repos) + +### External Documentation + +- [Azure Functions Overview](https://docs.microsoft.com/azure/azure-functions/) +- [Azure Functions Isolated Worker Guide](https://learn.microsoft.com/azure/azure-functions/dotnet-isolated-process-guide) +- [W3C Trace Context](https://www.w3.org/TR/trace-context/) +- [OpenTelemetry Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/) +- [gRPC Protocol](https://grpc.io/docs/) + +### Related dd-trace-dotnet Documentation + +- [AzureFunctions.md](AzureFunctions.md) - Integration guide for dd-trace-dotnet +- [AutomaticInstrumentation.md](AutomaticInstrumentation.md) - Creating integrations +- [DuckTyping.md](DuckTyping.md) - Duck typing patterns for third-party types +- [AGENTS.md](../../AGENTS.md) - Repository structure + +--- + +**Document Version**: 1.0 +**Last Updated**: 2025-01-09 +**Maintainer**: Lucas Pimentel (@lucaspimentel) diff --git a/docs/development/AzureFunctions.md b/docs/development/AzureFunctions.md index aeef3c9bec1c..74307386a5f0 100644 --- a/docs/development/AzureFunctions.md +++ b/docs/development/AzureFunctions.md @@ -1,10 +1,40 @@ ο»Ώ# Azure Functions Integration +This document describes how dd-trace-dotnet integrates with Azure Functions for distributed tracing. + +**Related Documentation:** +- [Azure Functions Architecture Deep Dive](AzureFunctions-Architecture.md) - Detailed architectural information about Azure Functions Host and .NET Worker + Azure functions operates in one of two ways: - In-process - Isolated (out of process) +## NuGet Package Architecture + +### Datadog.AzureFunctions Package + +The `Datadog.AzureFunctions` NuGet package is the primary instrumentation package for Azure Functions. It contains: + +- **Datadog.Trace.dll** - The full managed tracer with all auto-instrumentation code +- **Native profiler binaries** - Platform-specific profiler libraries (e.g., `Datadog.Trace.ClrProfiler.Native.so` for Linux) +- **Monitoring home directory structure** - Deployed to `/home/site/wwwroot/datadog/` (Linux) or equivalent on Windows + +When profiling is enabled globally (via `CORECLR_ENABLE_PROFILING=1`), the native profiler loads `Datadog.Trace.dll` into **both the host process (`func.exe` or `Microsoft.Azure.WebJobs.Script.WebHost`) and the worker process**. This allows instrumentation code to run in both processes, which is essential for: +- Instrumenting host process methods (e.g., `GrpcMessageConversionExtensions.ToRpcHttp()`) +- Instrumenting worker process methods (e.g., `FunctionExecutionMiddleware`) +- Ensuring correct trace context propagation across the process boundary + +### Datadog.Trace Package + +The `Datadog.Trace` NuGet package is different from `Datadog.AzureFunctions`: + +- **Contains only** `Datadog.Trace.Manual.dll` - The manual instrumentation API +- **Does NOT contain** auto-instrumentation code or native profiler binaries +- Used by application code for manual instrumentation (e.g., `Tracer.Instance.StartActive()`) + +In Azure Functions scenarios, the worker application references `Datadog.Trace` for manual instrumentation, while `Datadog.AzureFunctions` provides the auto-instrumentation. The version of `Datadog.Trace` referenced by the application doesn't need to match `Datadog.AzureFunctions` exactly, as the auto-instrumentation comes entirely from `Datadog.Trace.dll` bundled in `Datadog.AzureFunctions`. + ## In-process Azure Functions integration In-process Azure functions were the default approach in early versions of the framework. We currently support v3 and v4 in-process Azure functions. In this model, the "host" process loads the customer's app as a class library. @@ -17,6 +47,8 @@ We also instrument the `FunctionExecutor`. This provides all the details about t Isolated functions are the only supported model for Azure Functions going forward. In this model, instead of the customer's app being a class library, its a real ASP.NET Core application. The host `func.exe` starts the customer app as a sub process, and sets up a GRPC channel between the two apps. The `func.exe` host acts as a proxy for requests to the customer's app. +For detailed information about the isolated worker architecture, gRPC protocol, and middleware model, see [Azure Functions Architecture Deep Dive](AzureFunctions-Architecture.md). + `func.exe` sets up an in-process Azure Function for every function in the customer's app. Each of the functions in `func.exe` are simple calls that proxy the request to the customer app, and then return the response. When an HTTP request is received by `func.exe`, it runs the in-process function as normal. As part of the in-process function execution, it creates a GRPC message (by serializing the incoming HTTP requests to a GRPC message), and forwards the request over GRPC to the customer app. The customer's app runs the _real_ Azure function, and returns the response back over GRPC, where it is deserialized and turned into an HTTP response. @@ -81,19 +113,76 @@ gantt ### Isolated Azure Functions with ASP.NET Core Integration -Isolated Azure Functions also supports an [ASP.NET Core Integration](https://learn.microsoft.com/en-us/azure/azure-functions/dotnet-isolated-process-guide?tabs=hostbuilder%2Cwindows#aspnet-core-integration) that operates differently. +Isolated Azure Functions also supports an [ASP.NET Core Integration](https://learn.microsoft.com/en-us/azure/azure-functions/dotnet-isolated-process-guide?tabs=hostbuilder%2Cwindows#aspnet-core-integration) that operates differently from the standard gRPC-based communication. + +#### HTTP Proxying Architecture + +When ASP.NET Core integration is enabled, the `func.exe` host uses **HTTP proxying** instead of sending all request details over gRPC: -When operating in this mode the `func.exe` will *proxy* the HTTP Triggers with an HTTP Request instead of sending it as a gRPC message. -This means that we basically have two separate modes of operation here as typically the context is stored in the gRPC message, but the message is empty when proxying. +1. **Request Reception**: `func.exe` receives the incoming HTTP request +2. **Capability Detection**: Host checks if worker advertised the `HttpUri` capability +3. **Minimal gRPC Message**: `GrpcMessageConversionExtensions.ToRpcHttp()` returns an **empty** or minimal gRPC message (see [GrpcMessageConversionExtensions.cs:123-125](https://github.com/Azure/azure-functions-host/blob/dev/src/WebJobs.Script.Grpc/MessageExtensions/GrpcMessageConversionExtensions.cs#L123-L125)) +4. **HTTP Forwarding**: Host uses [YARP (Yet Another Reverse Proxy)](https://microsoft.github.io/reverse-proxy/) to forward the original HTTP request to the worker process via `DefaultHttpProxyService.StartForwarding()` ([source](https://github.com/Azure/azure-functions-host/blob/dev/src/WebJobs.Script/Http/DefaultHttpProxyService.cs#L82-L108)) +5. **Worker Processing**: Worker receives the HTTP request and processes the function +6. **Response Proxying**: Worker's HTTP response is proxied back through the host to the client -Our workaround to support this at the moment is to create our own gRPC type that is empty and inject the context in there. -We could in the future udpate this to extract the context out of the HTTP headers that we do get from this as well. +**Key Components:** +- `DefaultHttpProxyService` - Uses YARP's `IHttpForwarder` with `SocketsHttpHandler` for HTTP proxying +- `RetryProxyHandler` - Wraps `SocketsHttpHandler` to add retry logic +- `ScriptInvocationRequestTransformer` - YARP transformer to add correlation headers (e.g., invocation ID) -Whether a Functions app uses this new mode is subtle: +**Code Flow:** +``` +Client β†’ func.exe (host) β†’ DefaultHttpProxyService β†’ SocketsHttpHandler β†’ HTTP β†’ Worker ASP.NET Core app +``` + +#### Detection + +Whether a Functions app uses ASP.NET Core integration: + +- Project uses `ConfigureFunctionsWebApplication()` instead of `ConfigureFunctionsWorkerDefaults()` +- Project references [Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore](https://www.nuget.org/packages/Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore/) package +- Worker advertises `HttpUri` capability to the host during initialization + +#### Trace Context Propagation + +In ASP.NET Core integration mode: +- The gRPC message is **empty** (or contains only route parameters if `RequiresRouteParameters` capability is set) +- Trace context flows via **HTTP headers** from host to worker +- Our HTTP client instrumentation (intercepting `SocketsHttpHandler`): + - Creates a span for the hostβ†’worker HTTP call + - Injects trace context headers (`x-datadog-*`) into the proxied HTTP request +- Worker's ASP.NET Core instrumentation extracts context from HTTP headers -`ConfigureFunctionsWebApplication()` is `ConfigureFunctionsWorkerDefaults()` +#### Distributed Tracing Architecture -and the project will have a reference to [Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore](https://www.nuget.org/packages/Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore/) package. +**Important**: Host and worker processes send spans **independently** to the Datadog Agent: + +1. **Host process** (PID 27) serializes and sends its own spans +2. **Worker process** (PID 56) serializes and sends its own spans +3. **Datadog backend** stitches them into a distributed trace using trace IDs and parent span IDs + +Each process tags its spans with `aas.function.process: host` or `aas.function.process: worker` during serialization (see `SpanMessagePackFormatter.cs:742-754`). This tag is determined by `EnvironmentHelpers.IsRunningInAzureFunctionsHost()`, which checks the command line for `--workerId` or `--functions-worker-id` flags: +- **No flags** β†’ Host process β†’ `aas.function.process: host` +- **Has flags** β†’ Worker process β†’ `aas.function.process: worker` + +**Example distributed trace structure**: + +``` +Trace ID: 68e948220000000047fef7bad8bb854e + +Host spans (serialized by PID 27, tagged as "host"): +β”œβ”€ aspnet_core.request (s_id: 8ec7..., p_id: null) [ROOT] +β”œβ”€ azure_functions.invoke (s_id: 10c8..., p_id: 8ec7...) [Child of root] +└─ http.request (s_id: 2ac3..., p_id: 10c8...) [HTTP call to worker] + +Worker spans (serialized by PID 56, tagged as "worker"): +└─ aspnet_core.request (s_id: 9ddf..., p_id: 2ac3...) [Child of host's http.request] + └─ azure_functions.invoke (s_id: 114d..., p_id: 9ddf...) [Child of worker's aspnet_core] + └─ ... (additional spans) +``` + +**Key insight**: When troubleshooting traces in Datadog, a single distributed trace will contain spans from both processes with different `aas.function.process` tag values. This is expected behavior. ## Local Development @@ -125,6 +214,564 @@ This script: See `Get-Help .\tracer\tools\Build-AzureFunctionsNuget.ps1 -Full` for complete documentation. +### Accessing Tracer Logs + +When troubleshooting Azure Functions instrumentation, you may need to access tracer logs from both the host and worker processes. There are two methods to access these logs: + +#### Method 1: Datadog Logs API + +Query logs that have been sent to Datadog via direct log submission: + +```bash +curl -s -X POST "https://api.datadoghq.com/api/v2/logs/events/search" \ + -H "DD-API-KEY: " \ + -H "DD-APPLICATION-KEY: " \ + -H "Content-Type: application/json" \ + -d '{ + "filter": { + "query": "service: env: DD-TRACE-DOTNET", + "from": "now-15m", + "to": "now" + }, + "sort": "timestamp", + "page": { + "limit": 100 + } + }' | jq -r '.data[] | .attributes | "\(.timestamp) - \(.message[:200])"' +``` + +**Limitations:** +- Requires `DD_LOGS_DIRECT_SUBMISSION_INTEGRATIONS=ILogger` for worker process logs +- Requires `DD_LOGS_DIRECT_SUBMISSION_AZURE_FUNCTIONS_HOST_ENABLED=true` for host process logs +- Host process logs may not be sent to Datadog if direct submission for the host is not enabled or configured correctly +- Log ingestion may be delayed by several seconds + +#### Method 2: Download Logs from Azure (Recommended for Debugging) + +For more complete and immediate access to tracer logs, download logs directly from Azure: + +```bash +# Download logs to a local zip file +az webapp log download --name --resource-group --log-file D:/temp/azure-functions-logs.zip + +# Extract the logs +cd D:/temp +unzip -q -o azure-functions-logs.zip + +# View tracer logs +ls -la LogFiles/datadog/ +``` + +**Tracer log file naming convention:** +- **Worker process**: `dotnet-tracer-managed-dotnet-.log` + - Example: `dotnet-tracer-managed-dotnet-57.log` +- **Host process**: `dotnet-tracer-managed-Microsoft.Azure.WebJobs.Script.WebHost-.log` + - Example: `dotnet-tracer-managed-Microsoft.Azure.WebJobs.Script.WebHost-27.log` + - Log files may be numbered (e.g., `*_001.log`, `*_002.log`) when they reach a certain size + +**Viewing logs:** +```bash +# View end of host log +tail -100 LogFiles/datadog/dotnet-tracer-managed-Microsoft.Azure.WebJobs.Script.WebHost-27.log + +# Search for specific integration activity +grep -i "ToRpcHttp\|FunctionInvocationMiddleware\|HttpTest" LogFiles/datadog/dotnet-tracer-managed-Microsoft.Azure.WebJobs.Script.WebHost-27.log + +# View spans being created +grep "Span started\|Span closed" LogFiles/datadog/dotnet-tracer-managed-Microsoft.Azure.WebJobs.Script.WebHost-27.log +``` + +**Advantages of Azure logs:** +- Contains logs from both host and worker processes, even if direct log submission is not configured +- No delay - logs are immediately available +- More complete - includes all debug logs written to file +- Shows detailed span information including trace IDs, span IDs, parent IDs, and tags + +## Known Issues + +See the `docs/development/investigations/` directory for detailed documentation of specific issues and ongoing investigations: + +- **[APMSVLS-58: AsyncLocal Context Flow Issue](investigations/APMSVLS-58-AsyncLocal-Context-Flow.md)** - Span parenting issue with ASP.NET Core integration in isolated Azure Functions + +## Troubleshooting with Azure Logs + +When troubleshooting Azure Functions instrumentation issues, understanding how to effectively use tracer logs is critical. This section covers best practices for working with logs from deployed Azure Functions. + +### Understanding Log File Behavior + +#### Log File Persistence + +**Critical**: Azure Functions log files are **append-only** and **not cleared on deployment**: + +- Existing log files continue to grow indefinitely +- New log entries are appended to the end +- Old log entries from previous versions remain at the beginning +- A single log file may contain entries spanning days or weeks + +**Implication**: You **cannot** rely on file position (head/tail) to find recent logs. **Always filter by timestamp**. + +#### Host vs Worker Processes + +Azure Functions in **isolated mode** consists of two separate processes: + +1. **Host Process** (`Microsoft.Azure.WebJobs.Script.WebHost` or `func.exe`) + - Manages function triggers, scaling, and HTTP proxying + - Log file pattern: `dotnet-tracer-managed-Microsoft.Azure.WebJobs.Script.WebHost-{pid}.log` + - May have numbered files when size limit reached: `*_001.log`, `*_002.log`, etc. + +2. **Worker Process** (`dotnet`) + - Executes your function code + - Log file pattern: `dotnet-tracer-managed-dotnet-{pid}.log` + - Multiple worker processes may exist simultaneously (different PIDs) + +#### Worker Process Lifecycle + +**Important**: Worker processes restart on deployment, but they may **reuse the same PID** (likely due to running in Linux containers): + +- The log file PID doesn't change, but the process restarts +- Log entries from multiple process lifetimes appear in the same file +- Old entries (from before restart) remain at the beginning of the file +- New entries (from after restart) are appended to the end +- A single log file may contain entries spanning multiple deployments + +**Verifying when the worker last initialized**: +```bash +# Find ALL initialization timestamps (shows multiple restarts) +grep "Assembly metadata" worker.log + +# Find the MOST RECENT initialization (current worker process) +grep "Assembly metadata" worker.log | tail -1 + +# Check the version in recent log entries +grep "TracerVersion" worker.log | tail -10 +``` + +Example output showing multiple initializations in the same file: +``` +2025-10-03 10:15:23.456 +00:00 [INF] Assembly metadata: { AssemblyVersion: "3.27.0.0", ...} ← Old deployment +2025-10-10 14:32:18.123 +00:00 [INF] Assembly metadata: { AssemblyVersion: "3.28.0.0", ...} ← After restart +2025-10-10 17:45:02.789 +00:00 [INF] Assembly metadata: { AssemblyVersion: "3.29.0.0", ...} ← Current (most recent) +``` + +**Why this matters**: When filtering logs by timestamp, you must ensure you're looking at entries from the **current process lifetime** (after the most recent initialization), not from previous deployments that remain in the same file. + +### Downloading Logs from Azure + +Use the Azure CLI to download all log files at once: + +```bash +# Download logs to a local zip file +az functionapp log download \ + --name lucasp-premium-linux-isolated \ + --resource-group lucasp \ + --log-path D:/temp/azure-logs.zip + +# Extract the logs +unzip D:/temp/azure-logs.zip -d D:/temp/LogFiles + +# List tracer logs +ls -lh D:/temp/LogFiles/datadog/ +``` + +**Tip**: Download logs regularly when troubleshooting to capture different test executions. + +### Finding Relevant Log Entries + +#### Step 1: Identify the Execution Timestamp + +Before analyzing logs, determine when your test execution occurred: + +**Option A - Note the time when triggering**: +```bash +# Trigger and note UTC time +curl https://.azurewebsites.net/api/HttpTest +# Timestamp: 2025-10-10 17:53:39 UTC +``` + +**Option B - Search host logs for the trigger**: +```bash +grep "Executing 'Functions" D:/temp/LogFiles/datadog/dotnet-tracer-managed-Microsoft.Azure.WebJobs.Script.WebHost-*.log +``` + +Example output: +``` +2025-10-10 17:53:39.123 [INF] Executing 'Functions.HttpTest' +``` + +#### Step 2: Extract Trace ID from Host Logs + +Find the trace ID created by the host process: + +```bash +grep "2025-10-10 17:53:39" D:/temp/LogFiles/datadog/dotnet-tracer-managed-Microsoft.Azure.WebJobs.Script.WebHost-*.log | grep "Span started" +``` + +Example output: +``` +2025-10-10 17:53:39.123 +00:00 [DBG] Span started: [s_id: abc123def456, p_id: null, t_id: 68e948220000000047fef7bad8bb854e] +``` + +The trace ID is `68e948220000000047fef7bad8bb854e`. + +#### Step 3: Identify Active Worker Process + +List worker log files and check which ones have recent activity: + +```bash +# List all worker logs with sizes +ls -lh D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-*.log + +# Find which PID was active at your timestamp +grep "2025-10-10 17:53:" D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-*.log | head -1 +``` + +Example output showing PID 56 was active: +``` +D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log:2025-10-10 17:53:38.194 ... +``` + +#### Step 4: Filter Worker Logs by Timestamp + +**Always filter by timestamp** - never use head/tail on downloaded log files: + +```bash +# Get all logs in the minute of interest +grep "2025-10-10 17:53:" D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log + +# Get logs in a specific second range (17:53:38-39) +grep "2025-10-10 17:53:3[89]" D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log + +# Search for specific messages in that timeframe +grep "2025-10-10 17:53:3[89]" D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log | grep "Azure Functions span creation" +``` + +**Why timestamp filtering is essential**: +- ❌ `tail -100 worker.log` may show logs from weeks ago +- ❌ `head -100 worker.log` definitely shows old initialization logs +- βœ… `grep "2025-10-10 17:53:" worker.log` shows only relevant entries + +#### Step 5: Verify Tracer Version + +Always verify the worker loaded the expected tracer version before investigating behavior: + +```bash +# Check when worker initialized and which version +grep "Assembly metadata" D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log | head -1 + +# Verify version in recent log entries +grep "2025-10-10 17:53:" D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log | grep "TracerVersion" | head -1 +``` + +If the version is outdated, restart the Function App and re-test. + +### Common Troubleshooting Scenarios + +#### Scenario 1: Separate Traces (Parenting Issue) + +**Symptom**: Host and worker create spans in different traces instead of a single unified trace. + +**Investigation**: + +1. Get the trace ID from host logs at the execution timestamp (see Step 2 above) +2. Search worker logs for the same trace ID: + ```bash + grep "68e948220000000047fef7bad8bb854e" D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log + ``` +3. If not found, the worker created a separate trace +4. Look for worker spans with `p_id: null` (root spans) instead of parent IDs matching host spans + +**Understanding span relationships**: + +When analyzing span logs, look for these fields in "Span started" messages: +- **s_id** (span ID): Unique identifier for this span +- **p_id** (parent ID): Span ID of the parent span (`null` = root span) +- **t_id** (trace ID): Trace this span belongs to + +**Healthy trace structure** (host + worker in same trace): +``` +Host: + [s_id: A, p_id: null, t_id: 68e9...] ← Root span (host) + +Worker: + [s_id: B, p_id: A, t_id: 68e9...] ← Child of A (same trace) + [s_id: C, p_id: B, t_id: 68e9...] ← Child of B (same trace) +``` + +**Broken trace structure** (separate traces): +``` +Host: + [s_id: A, p_id: null, t_id: 68e9...] ← Host trace + +Worker: + [s_id: X, p_id: null, t_id: 68e0...] ← Separate worker trace! (different t_id) +``` + +**Example investigation**: +```bash +# Find all spans created during execution +grep "2025-10-10 17:53:3[89]" host.log | grep "Span started" +grep "2025-10-10 17:53:3[89]" worker.log | grep "Span started" + +# Compare trace IDs - they should match +``` + +#### Scenario 2: Missing Debug Logs + +**Symptom**: Added debug logging but it doesn't appear in logs. + +**Possible causes and solutions**: + +1. **Deployment hasn't completed yet** - Worker restarted but still loading old version + ```bash + # Check most recent worker initialization + grep "Assembly metadata" worker.log | tail -1 + # Wait a minute and check again after deployment completes + ``` + +2. **Wrong log level** - Debug logs are filtered out + ```bash + # Check log level configuration + grep "DD_TRACE_LOG_LEVEL" host.log worker.log + # Should see: DD_TRACE_LOG_LEVEL=Debug + ``` + +3. **Looking at wrong timeframe** - Logs exist but searching wrong time + ```bash + # Use broader time range + grep "2025-10-10 17:5[0-9]:" worker.log | grep "your debug message" + ``` + +4. **Downloaded logs before execution** - Re-download after triggering + ```bash + # Trigger function first + curl https://.azurewebsites.net/api/HttpTest + # Wait a few seconds, then download + az functionapp log download --name --resource-group --log-path logs.zip + ``` + +#### Scenario 3: Wrong Tracer Version After Deployment + +**Symptom**: Deployed new code but behavior hasn't changed, or seeing old tracer version. + +**Investigation**: + +```bash +# Check ALL worker initializations (shows restart history) +grep "Assembly metadata" worker.log + +# Check the MOST RECENT initialization (current process) +grep "Assembly metadata" worker.log | tail -1 + +# Example output showing worker restarted but loaded old version +2025-10-10 17:45:02.789 +00:00 [INF] Assembly metadata: { AssemblyVersion: "3.27.0.0", ...} +``` + +If the most recent initialization shows an older version than expected, the deployment may not have completed properly. + +**Verification**: +```bash +# Check version in recent logs +grep "2025-10-10 17:53:" worker.log | grep "TracerVersion" | head -1 + +# Expected: TracerVersion: "3.29.0.0" (your new version) +# Actual: TracerVersion: "3.27.0.0" (old version still loaded) +``` + +**Possible causes**: +1. Deployment is still in progress - wait a few minutes and check again +2. Deployment failed - check Azure Portal deployment logs +3. Build artifact didn't include updated tracer - verify build output +4. NuGet cache issue - clear local NuGet cache and rebuild + +**Solutions**: +```bash +# 1. Wait and verify deployment completed +az functionapp deployment list --name --resource-group --query "[0]" + +# 2. Manually restart the Function App +az functionapp restart --name --resource-group + +# 3. Clear local NuGet cache and rebuild +dotnet nuget locals all --clear +dotnet clean +dotnet build +``` + +### Best Practices + +#### 1. Always Use Timestamp Filtering + +❌ **Don't do this**: +```bash +tail -100 worker.log # May show old logs from weeks ago +head -100 worker.log # Definitely shows old initialization logs +``` + +βœ… **Do this**: +```bash +grep "2025-10-10 17:53:" worker.log # Specific minute +grep "2025-10-10 17:53:3[89]" worker.log # Specific seconds +grep "2025-10-10 17:5[0-9]:" worker.log # Entire hour +``` + +#### 2. Correlate Host and Worker Logs by Trace ID + +Trace IDs are the key to understanding distributed traces: + +1. Get trace ID from host logs +2. Search worker logs for the same trace ID +3. Verify spans have correct parent-child relationships (matching `p_id` and `s_id`) + +#### 3. Verify Tracer Version Before Troubleshooting + +Before investigating unexpected behavior: + +1. Check worker log initialization timestamp +2. Verify `TracerVersion` in recent log entries matches expected version +3. If version is old, restart Function App and re-test + +#### 4. Use Context in Grep + +When searching for specific log messages, include context lines: + +```bash +# Show 2 lines before and after matches +grep -B 2 -A 2 "Azure Functions span creation" worker.log + +# Show 5 lines after to see span details +grep -A 5 "Span started" worker.log +``` + +#### 5. Track Multiple Executions Separately + +When testing repeatedly, save logs for each execution: + +```bash +# Execution 1 at 17:48 +grep "2025-10-10 17:48:" worker.log > exec1.log + +# Execution 2 at 17:53 +grep "2025-10-10 17:53:" worker.log > exec2.log + +# Compare behavior +diff exec1.log exec2.log +``` + +#### 6. Download Logs After Each Test + +After triggering a function for testing: + +```bash +# 1. Trigger the function +curl https://.azurewebsites.net/api/HttpTest + +# 2. Wait a few seconds for logs to be written +sleep 5 + +# 3. Download logs with timestamp in filename +az functionapp log download \ + --name \ + --resource-group \ + --log-path D:/temp/logs-$(date +%H%M%S).zip +``` + +This creates a snapshot of logs for each test execution. + +### Complete Investigation Workflow Example + +Here's an end-to-end workflow for investigating a span parenting issue: + +```bash +# === Step 1: Trigger and capture timestamp === +echo "Triggering at $(date -u +%Y-%m-%d\ %H:%M:%S) UTC" +curl https://lucasp-premium-linux-isolated.azurewebsites.net/api/HttpTest +# Output: Triggering at 2025-10-10 17:53:39 UTC + +sleep 5 # Wait for logs to be written + +# === Step 2: Download logs === +az functionapp log download \ + --name lucasp-premium-linux-isolated \ + --resource-group lucasp \ + --log-path D:/temp/logs-175339.zip + +unzip -q D:/temp/logs-175339.zip -d D:/temp/LogFiles + +# === Step 3: Find host trace ID === +grep "2025-10-10 17:53:39" \ + D:/temp/LogFiles/datadog/dotnet-tracer-managed-Microsoft.Azure.WebJobs.Script.WebHost-*.log \ + | grep "Span started" \ + | grep "azure_functions.invoke" + +# Output: t_id: 68e948220000000047fef7bad8bb854e + +# === Step 4: Identify active worker === +grep "2025-10-10 17:53:" \ + D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-*.log \ + | head -1 + +# Output: dotnet-56.log has activity + +# === Step 5: Verify worker tracer version === +grep "TracerVersion" \ + D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log \ + | tail -1 + +# Output: TracerVersion: "3.29.0.0" βœ“ Correct version + +# === Step 6: Check if worker spans are in same trace === +grep "68e948220000000047fef7bad8bb854e" \ + D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log + +# If no results, worker created separate trace - investigate further + +# === Step 7: Find all worker spans at this timestamp === +grep "2025-10-10 17:53:3[89]" \ + D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log \ + | grep "Span started" + +# Look for spans with p_id: null (root spans in separate traces) + +# === Step 8: Look for debug messages === +grep "2025-10-10 17:53:3[89]" \ + D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log \ + | grep "Azure Functions span creation" + +# Check ActiveScope and InternalActiveScope values + +# === Step 9: Examine all span relationships === +grep "2025-10-10 17:53:3[89]" \ + D:/temp/LogFiles/datadog/dotnet-tracer-managed-dotnet-56.log \ + | grep -E "Span (started|closed)" \ + | awk -F'[\\[\\]]' '{print $2}' # Extract s_id, p_id, t_id + +# Verify parent-child relationships form a proper tree +``` + +### Advanced: Parsing Span Relationships + +To visualize span relationships from logs: + +```bash +# Extract span info from host and worker logs +grep "2025-10-10 17:53:3[89]" host.log worker.log \ + | grep "Span started" \ + | sed -E 's/.*\[s_id: ([^,]+), p_id: ([^,]+), t_id: ([^\]]+)\].*/\3 | \1 | \2/' \ + | sort + +# Output format: trace_id | span_id | parent_id +# Use this to build a mental model of the trace tree +``` + +Example output: +``` +68e9482200... | abc123... | null ← Root (host) +68e9482200... | def456... | abc123... ← Child of root (host) +68e0012345... | xyz789... | null ← Separate root (worker) ❌ +``` + +This clearly shows the worker created a separate trace. + ## Debugging To debug Azure Functions locally ensure that you have the following: diff --git a/docs/development/QueryingDatadogAPIs.md b/docs/development/QueryingDatadogAPIs.md new file mode 100644 index 000000000000..ce0dd159a65f --- /dev/null +++ b/docs/development/QueryingDatadogAPIs.md @@ -0,0 +1,421 @@ +# Querying Datadog APIs for Debugging + +When troubleshooting tracer issues, you can query spans and logs directly from the Datadog API to verify instrumentation behavior, span relationships, and context propagation. + +## Spans Search API + +### Search for spans with specific criteria + +```bash +curl -s -X POST https://api.datadoghq.com/api/v2/spans/events/search \ + -H "DD-API-KEY: " \ + -H "DD-APPLICATION-KEY: " \ + -H "Content-Type: application/json" \ + -d '{ + "data": { + "attributes": { + "filter": { + "query": "service: env:", + "from": "now-15m", + "to": "now" + }, + "sort": "-timestamp", + "page": { + "limit": 20 + } + }, + "type": "search_request" + } + }' | jq -r '.data[] | .attributes | "Name: \(.operation_name)\nResource: \(.resource_name)\nSpan ID: \(.span_id)\nParent ID: \(.parent_id)\nTrace ID: \(.trace_id)\n---"' +``` + +### Get all spans in a specific trace to verify parent-child relationships + +```bash +curl -s -X POST https://api.datadoghq.com/api/v2/spans/events/search \ + -H "DD-API-KEY: " \ + -H "DD-APPLICATION-KEY: " \ + -H "Content-Type: application/json" \ + -d '{ + "data": { + "attributes": { + "filter": { + "query": "trace_id:", + "from": "now-1h", + "to": "now" + }, + "sort": "-timestamp", + "page": { + "limit": 50 + } + }, + "type": "search_request" + } + }' | jq -r '.data[] | .attributes | "Name: \(.operation_name)\nResource: \(.resource_name)\nSpan ID: \(.span_id)\nParent ID: \(.parent_id)\n---"' +``` + +## Important Notes + +- The request body must be wrapped in `{"data": {"attributes": {...}, "type": "search_request"}}` structure +- API keys can be obtained from https://app.datadoghq.com/organization-settings/api-keys +- Full API documentation: https://docs.datadoghq.com/api/latest/spans/ +- **Shell variable substitution issue**: If using environment variables like `${DD_API_KEY}` fails with "Unauthorized" errors, try using the key values directly in the curl command instead of shell variable substitution. Some shells may have issues with variable expansion in curl headers. + +## Common Use Cases + +- **Verifying span parenting**: Check that `parent_id` matches the expected `span_id` +- **Debugging distributed tracing context propagation**: Verify trace context flows correctly across process boundaries +- **Validating span tags and attributes**: Ensure custom tags and integration-specific attributes are set correctly +- **Investigating timing and duration issues**: Compare span durations and timestamps to identify bottlenecks + +## Query Syntax + +The `query` field supports the Datadog query syntax: + +- `service:` - Filter by service name +- `env:` - Filter by environment +- `resource_name:` - Filter by resource name (supports wildcards: `*HttpTest*`) +- `trace_id:` - Get all spans in a specific trace +- `operation_name:` - Filter by operation name +- Combine with boolean operators: `service:my-service AND env:prod` +- Exclude with `-`: `service:my-service -resource_name:*ping*` + +## Response Structure + +The Spans API returns data in the following structure: + +```json +{ + "data": [ + { + "id": "...", + "type": "spans", + "attributes": { + "operation_name": "azure_functions.invoke", + "resource_name": "GET /api/endpoint", + "span_id": "4208934728885019856", + "parent_id": "0", + "trace_id": "690507fc00000000b882bcd2bdac6b9e", + "start_timestamp": 1761937404333164200, + "end_timestamp": 1761937404533785600, + "status": "ok", + "service": "service-name", + "env": "environment", + "tags": ["tag1:value1", "tag2:value2"], + "custom": { + "duration": 200621200, + "aas": { + "function": { + "process": "host", + "name": "HttpTest" + } + } + } + } + } + ] +} +``` + +Key fields: +- `attributes.span_id` and `attributes.parent_id` - Direct span relationship fields (not in `custom`) +- `attributes.operation_name` and `attributes.resource_name` - At root of `attributes` +- `attributes.custom.*` - Custom tags and metadata (including nested objects like `aas.function.process`) +- `attributes.tags[]` - Array of tag strings in `key:value` format + +## Example: Debugging Span Parenting + +When investigating span parenting issues (e.g., verifying that child spans are correctly linked to parent spans): + +1. Trigger the operation you want to debug +2. Query for recent spans to get a trace ID: + ```bash + curl -s -X POST https://api.datadoghq.com/api/v2/spans/events/search \ + -H "DD-API-KEY: " \ + -H "DD-APPLICATION-KEY: " \ + -H "Content-Type: application/json" \ + -d '{ + "data": { + "attributes": { + "filter": { + "query": "service:my-service resource_name:*MyFunction*", + "from": "now-5m", + "to": "now" + }, + "sort": "-timestamp", + "page": {"limit": 5} + }, + "type": "search_request" + } + }' | jq -r '.data[0].attributes.trace_id' + ``` + +3. Get all spans in that trace: + ```bash + curl -s -X POST https://api.datadoghq.com/api/v2/spans/events/search \ + -H "DD-API-KEY: " \ + -H "DD-APPLICATION-KEY: " \ + -H "Content-Type: application/json" \ + -d '{ + "data": { + "attributes": { + "filter": { + "query": "trace_id:", + "from": "now-1h", + "to": "now" + }, + "sort": "-timestamp", + "page": {"limit": 50} + }, + "type": "search_request" + } + }' | jq -r '.data[] | .attributes | "Name: \(.operation_name)\nResource: \(.resource_name)\nSpan ID: \(.span_id)\nParent ID: \(.parent_id)\nDuration: \(.custom.duration // 0)\n---"' + ``` + +4. Verify the parent-child relationships by checking that each span's `parent_id` matches another span's `span_id` in the trace. + +## PowerShell Helper Script + +The repository includes a PowerShell script that simplifies querying traces from the Datadog API. + +### Get-DatadogTrace.ps1 + +**Location**: `tracer/tools/Get-DatadogTrace.ps1` + +This script retrieves all spans for a given trace ID and displays them in various formats. + +**Prerequisites:** +- Set environment variables: `DD_API_KEY` and `DD_APPLICATION_KEY` +- API keys can be obtained from https://app.datadoghq.com/organization-settings/api-keys + +**Basic usage:** +```powershell +.\tracer\tools\Get-DatadogTrace.ps1 -TraceId "690507fc00000000b882bcd2bdac6b9e" +``` + +**Parameters:** +- `-TraceId` (required) - The 128-bit trace ID in hex format +- `-TimeRange` (optional) - How far back to search (default: "2h"). Examples: "15m", "1h", "2h", "1d" +- `-OutputFormat` (optional) - Output format: "table" (default), "json", or "hierarchy" + +**Output formats:** + +1. **Table** (default) - Formatted table with key span information: + ```powershell + .\tracer\tools\Get-DatadogTrace.ps1 -TraceId "690507fc00000000b882bcd2bdac6b9e" + ``` + Shows: Operation, Resource, Span ID, Parent ID, Process, Duration + +2. **Hierarchy** - Tree view showing parent-child relationships: + ```powershell + .\tracer\tools\Get-DatadogTrace.ps1 -TraceId "690507fc00000000b882bcd2bdac6b9e" -OutputFormat hierarchy + ``` + Useful for visualizing span nesting and verifying proper parent-child relationships + +3. **JSON** - Raw JSON output for further processing: + ```powershell + .\tracer\tools\Get-DatadogTrace.ps1 -TraceId "690507fc00000000b882bcd2bdac6b9e" -OutputFormat json + ``` + +**Example workflow:** + +1. Get a recent trace ID from your service: + ```bash + curl -s -X POST https://api.datadoghq.com/api/v2/spans/events/search \ + -H "DD-API-KEY: $DD_API_KEY" \ + -H "DD-APPLICATION-KEY: $DD_APPLICATION_KEY" \ + -H "Content-Type: application/json" \ + -d '{"data":{"attributes":{"filter":{"query":"service:my-service","from":"now-5m","to":"now"},"sort":"-timestamp","page":{"limit":1}},"type":"search_request"}}' \ + | jq -r '.data[0].attributes.trace_id' + ``` + +2. Analyze the full trace with the PowerShell script: + ```powershell + .\tracer\tools\Get-DatadogTrace.ps1 -TraceId "" -OutputFormat hierarchy + ``` + +**Note**: The script uses `Invoke-RestMethod` which properly handles authentication headers, avoiding the shell variable substitution issues that can occur with curl. + +## Logs Search API + +### Search for logs with specific criteria (curl) + +```bash +curl -s -X POST https://api.datadoghq.com/api/v2/logs/events/search \ + -H "DD-API-KEY: " \ + -H "DD-APPLICATION-KEY: " \ + -H "Content-Type: application/json" \ + -d '{ + "filter": { + "query": "service: env:", + "from": "now-15m", + "to": "now" + }, + "sort": "timestamp", + "page": { + "limit": 100 + } + }' | jq -r '.data[] | .attributes | "\(.timestamp) [\(.status)] \(.message)"' +``` + +**Note:** Unlike the Spans API, the Logs API does not require the `{"data": {"attributes": {...}, "type": "search_request"}}` wrapper structure. The request body is directly the filter/sort/page configuration. + +## PowerShell Helper Script - Logs + +The repository includes a PowerShell script that simplifies querying logs from the Datadog API. + +### Get-DatadogLogs.ps1 + +**Location**: `tracer/tools/Get-DatadogLogs.ps1` + +This script retrieves logs matching a query and displays them in various formats. + +**Prerequisites:** +- Set environment variables: `DD_API_KEY` and `DD_APPLICATION_KEY` +- API keys can be obtained from https://app.datadoghq.com/organization-settings/api-keys + +**Basic usage:** +```powershell +.\tracer\tools\Get-DatadogLogs.ps1 -Query "service:my-service error" +``` + +**Parameters:** +- `-Query` (required) - Log query using Datadog query syntax (e.g., "service:my-service error") +- `-TimeRange` (optional) - How far back to search (default: "1h"). Examples: "15m", "1h", "2h", "1d" +- `-Limit` (optional) - Maximum number of log entries to return (default: 50, max: 1000) +- `-OutputFormat` (optional) - Output format: "table" (default), "json", or "raw" + +**Output formats:** + +1. **Table** (default) - Formatted table with key log information: + ```powershell + .\tracer\tools\Get-DatadogLogs.ps1 -Query "service:my-service error" + ``` + Shows: Timestamp, Status, Service, Host, Message (truncated to 100 chars) + +2. **JSON** - Raw JSON output for further processing: + ```powershell + .\tracer\tools\Get-DatadogLogs.ps1 -Query "service:my-service" -OutputFormat json + ``` + +3. **Raw** - Simple timestamp and message format: + ```powershell + .\tracer\tools\Get-DatadogLogs.ps1 -Query "service:my-service" -OutputFormat raw + ``` + +**Example workflows:** + +1. Search for recent errors in a service: + ```powershell + .\tracer\tools\Get-DatadogLogs.ps1 -Query "service:my-service error" -TimeRange "1h" -Limit 20 + ``` + +2. Find tracer debug logs from Azure Functions: + ```powershell + .\tracer\tools\Get-DatadogLogs.ps1 -Query "service:lucasp-premium-linux-isolated AspNetCoreDiagnosticObserver" -TimeRange "30m" + ``` + +3. Export logs as JSON for processing: + ```powershell + .\tracer\tools\Get-DatadogLogs.ps1 -Query "service:my-service DD-TRACE-DOTNET" -OutputFormat json | ConvertFrom-Json | ConvertTo-Json -Depth 10 + ``` + +**Note**: The script uses `Invoke-RestMethod` which properly handles authentication headers, avoiding the shell variable substitution issues that can occur with curl. + +### Search for tracer debug logs + +```bash +curl -s -X POST https://api.datadoghq.com/api/v2/logs/events/search \ + -H "DD-API-KEY: " \ + -H "DD-APPLICATION-KEY: " \ + -H "Content-Type: application/json" \ + -d '{ + "filter": { + "query": "env: DD-TRACE-DOTNET", + "from": "now-15m", + "to": "now" + }, + "sort": "timestamp", + "page": { + "limit": 100 + } + }' | jq -r '.data[] | .attributes | "\(.timestamp) - \(.message[:150])"' +``` + +### Search for logs with specific attributes + +You can filter by attributes using the `@` prefix: + +```bash +curl -s -X POST https://api.datadoghq.com/api/v2/logs/events/search \ + -H "DD-API-KEY: " \ + -H "DD-APPLICATION-KEY: " \ + -H "Content-Type: application/json" \ + -d '{ + "filter": { + "query": "env: @Category:Host.Function.Console", + "from": "now-15m", + "to": "now" + }, + "sort": "timestamp", + "page": { + "limit": 50 + } + }' | jq -r '.data[] | .attributes | "\(.timestamp) - \(.attributes.Category) - \(.message[:100])"' +``` + +### Use absolute timestamps for precise time ranges + +When debugging specific function invocations, use absolute timestamps: + +```bash +curl -s -X POST https://api.datadoghq.com/api/v2/logs/events/search \ + -H "DD-API-KEY: " \ + -H "DD-APPLICATION-KEY: " \ + -H "Content-Type: application/json" \ + -d '{ + "filter": { + "query": "env: my-search-term", + "from": "2025-10-08T21:01:40", + "to": "2025-10-08T21:02:10" + }, + "sort": "timestamp", + "page": { + "limit": 100 + } + }' | jq -r '.data[] | .attributes | "\(.timestamp) - \(.message)"' +``` + +## Azure Functions Logging Configuration + +### Enabling Direct Log Submission + +For Azure Functions, tracer logs can be sent directly to Datadog via direct log submission: + +**Worker Process Logs:** +- Set `DD_LOGS_DIRECT_SUBMISSION_INTEGRATIONS=ILogger` to enable direct log submission from the worker process +- Worker logs will appear with `service:` and may include `Category:Host.Function.Console` + +**Host Process Logs:** +- Set `DD_LOGS_DIRECT_SUBMISSION_AZURE_FUNCTIONS_HOST_ENABLED=true` to enable direct log submission from the Azure Functions host process +- Host process logs contain tracer diagnostics from the host (e.g., `GrpcMessageConversionExtensions.ToRpcHttp`) +- Requires tracer version 3.29.0 or later + +**Debug Logging:** +- Set `DD_TRACE_DEBUG=true` to enable debug-level logging from the tracer +- Debug logs will include `DD-TRACE-DOTNET` prefix in the message +- Both `Log.Debug()` and `Log.Information()` calls will appear in Datadog + +### Common Log Categories + +Azure Functions logs may be tagged with different categories: + +- `Host.Function.Console` - Console output from worker process functions +- `Microsoft.Azure.WebJobs.Script.WebHost.Middleware.*` - Host middleware logs +- `Function.` - Function-specific execution logs +- `Microsoft.AspNetCore.*` - ASP.NET Core framework logs (when using ASP.NET Core integration) + +## API Documentation + +- **Spans API**: https://docs.datadoghq.com/api/latest/spans/ +- **Logs API**: https://docs.datadoghq.com/api/latest/logs/ diff --git a/tracer/tools/Get-DatadogLogs.ps1 b/tracer/tools/Get-DatadogLogs.ps1 new file mode 100644 index 000000000000..776f2d888444 --- /dev/null +++ b/tracer/tools/Get-DatadogLogs.ps1 @@ -0,0 +1,149 @@ +<# +.SYNOPSIS +Retrieves logs from the Datadog Logs API. + +.DESCRIPTION +Queries the Datadog Logs API to retrieve logs matching a specific query. +Requires DD_API_KEY and DD_APPLICATION_KEY environment variables to be set. + +.PARAMETER Query +The log query to search for (e.g., "service:my-service error") + +.PARAMETER TimeRange +How far back to search for logs. Defaults to "1h" (1 hour). +Examples: "15m", "1h", "2h", "1d" + +.PARAMETER Limit +Maximum number of log entries to return (default: 50, max: 1000) + +.PARAMETER OutputFormat +Output format: "table" (default), "json", or "raw" + +.EXAMPLE +.\Get-DatadogLogs.ps1 -Query "service:lucasp-premium-linux-isolated AspNetCoreDiagnosticObserver" + +.EXAMPLE +.\Get-DatadogLogs.ps1 -Query "service:my-app error" -TimeRange "2h" -Limit 100 + +.EXAMPLE +.\Get-DatadogLogs.ps1 -Query "service:my-app" -OutputFormat json | ConvertFrom-Json | ConvertTo-Json -Depth 10 +#> + +[CmdletBinding()] +param( + [Parameter(Mandatory = $true, Position = 0)] + [string]$Query, + + [Parameter(Mandatory = $false)] + [string]$TimeRange = "1h", + + [Parameter(Mandatory = $false)] + [int]$Limit = 50, + + [Parameter(Mandatory = $false)] + [ValidateSet("table", "json", "raw")] + [string]$OutputFormat = "table" +) + +# Check for required environment variables +$apiKey = $env:DD_API_KEY +$appKey = $env:DD_APPLICATION_KEY + +if ([string]::IsNullOrEmpty($apiKey)) { + Write-Error "DD_API_KEY environment variable is not set. Please set it before running this script." + exit 1 +} + +if ([string]::IsNullOrEmpty($appKey)) { + Write-Error "DD_APPLICATION_KEY environment variable is not set. Please set it before running this script." + exit 1 +} + +# Build the request +$uri = "https://api.datadoghq.com/api/v2/logs/events/search" +$headers = @{ + "DD-API-KEY" = $apiKey + "DD-APPLICATION-KEY" = $appKey + "Content-Type" = "application/json" +} + +$body = @{ + filter = @{ + query = $Query + from = "now-$TimeRange" + to = "now" + } + sort = "timestamp" + page = @{ + limit = $Limit + } +} | ConvertTo-Json -Depth 10 + +Write-Verbose "Querying Datadog Logs API with query: $Query" +Write-Verbose "Time range: now-$TimeRange to now" +Write-Verbose "Limit: $Limit" + +try { + $response = Invoke-RestMethod -Uri $uri -Method Post -Headers $headers -Body $body -ErrorAction Stop + + if ($null -eq $response.data -or $response.data.Count -eq 0) { + Write-Warning "No logs found for query: $Query" + Write-Warning "Make sure the query is correct and logs exist within the time range (now-$TimeRange to now)" + exit 0 + } + + $logs = $response.data | ForEach-Object { + $attrs = $_.attributes + + [PSCustomObject]@{ + Timestamp = $attrs.timestamp + Status = $attrs.status + Service = $attrs.service + Message = $attrs.message + Host = $attrs.host + Tags = $attrs.tags -join ", " + } + } + + switch ($OutputFormat) { + "table" { + Write-Host "`nFound $($logs.Count) log entries:" -ForegroundColor Green + Write-Host "Query: $Query`n" -ForegroundColor Cyan + + $logs | Format-Table -Property @( + @{Label = "Timestamp"; Expression = { + $dt = [DateTime]::Parse($_.Timestamp) + $dt.ToString("yyyy-MM-dd HH:mm:ss.fff") + }; Width = 23 } + @{Label = "Status"; Expression = { $_.Status }; Width = 8 } + @{Label = "Service"; Expression = { $_.Service }; Width = 30 } + @{Label = "Host"; Expression = { $_.Host }; Width = 20 } + @{Label = "Message"; Expression = { + if ($_.Message.Length -gt 100) { + $_.Message.Substring(0, 97) + "..." + } else { + $_.Message + } + }; Width = 100 } + ) -AutoSize -Wrap + } + "json" { + $logs | ConvertTo-Json -Depth 10 + } + "raw" { + $logs | ForEach-Object { + $dt = [DateTime]::Parse($_.Timestamp) + $ts = $dt.ToString("yyyy-MM-dd HH:mm:ss.fff") + Write-Host "$ts [$($_.Status)] $($_.Service) - $($_.Message)" + } + } + } + +} catch { + Write-Error "Failed to query Datadog Logs API: $_" + Write-Error $_.Exception.Message + if ($_.ErrorDetails.Message) { + Write-Error "API Response: $($_.ErrorDetails.Message)" + } + exit 1 +} diff --git a/tracer/tools/Get-DatadogTrace.ps1 b/tracer/tools/Get-DatadogTrace.ps1 new file mode 100644 index 000000000000..83e2e9ec2081 --- /dev/null +++ b/tracer/tools/Get-DatadogTrace.ps1 @@ -0,0 +1,189 @@ +<# +.SYNOPSIS +Retrieves all spans for a given trace ID from the Datadog API. + +.DESCRIPTION +Queries the Datadog Spans API to retrieve all spans belonging to a specific trace ID. +Requires DD_API_KEY and DD_APPLICATION_KEY environment variables to be set. + +.PARAMETER TraceId +The 128-bit trace ID to query (hex string, e.g., "690507fc00000000b882bcd2bdac6b9e") + +.PARAMETER TimeRange +How far back to search for the trace. Defaults to "2h" (2 hours). +Examples: "15m", "1h", "2h", "1d" + +.PARAMETER OutputFormat +Output format: "table" (default), "json", or "hierarchy" + +.EXAMPLE +.\Get-DatadogTrace.ps1 -TraceId "690507fc00000000b882bcd2bdac6b9e" + +.EXAMPLE +.\Get-DatadogTrace.ps1 -TraceId "690507fc00000000b882bcd2bdac6b9e" -OutputFormat hierarchy + +.EXAMPLE +.\Get-DatadogTrace.ps1 -TraceId "690507fc00000000b882bcd2bdac6b9e" -OutputFormat json | ConvertFrom-Json | ConvertTo-Json -Depth 10 +#> + +[CmdletBinding()] +param( + [Parameter(Mandatory = $true, Position = 0)] + [string]$TraceId, + + [Parameter(Mandatory = $false)] + [string]$TimeRange = "2h", + + [Parameter(Mandatory = $false)] + [ValidateSet("table", "json", "hierarchy")] + [string]$OutputFormat = "table" +) + +# Check for required environment variables +$apiKey = $env:DD_API_KEY +$appKey = $env:DD_APPLICATION_KEY + +if ([string]::IsNullOrEmpty($apiKey)) { + Write-Error "DD_API_KEY environment variable is not set. Please set it before running this script." + exit 1 +} + +if ([string]::IsNullOrEmpty($appKey)) { + Write-Error "DD_APPLICATION_KEY environment variable is not set. Please set it before running this script." + exit 1 +} + +# Build the request +$uri = "https://api.datadoghq.com/api/v2/spans/events/search" +$headers = @{ + "DD-API-KEY" = $apiKey + "DD-APPLICATION-KEY" = $appKey + "Content-Type" = "application/json" +} + +$body = @{ + data = @{ + attributes = @{ + filter = @{ + query = "trace_id:$TraceId" + from = "now-$TimeRange" + to = "now" + } + sort = "start_timestamp" + page = @{ + limit = 100 + } + } + type = "search_request" + } +} | ConvertTo-Json -Depth 10 + +Write-Verbose "Querying Datadog API for trace ID: $TraceId" +Write-Verbose "Time range: now-$TimeRange to now" + +try { + $response = Invoke-RestMethod -Uri $uri -Method Post -Headers $headers -Body $body -ErrorAction Stop + + if ($null -eq $response.data -or $response.data.Count -eq 0) { + Write-Warning "No spans found for trace ID: $TraceId" + Write-Warning "Make sure the trace ID is correct and the trace is within the time range (now-$TimeRange to now)" + exit 0 + } + + $spans = $response.data | ForEach-Object { + $attrs = $_.attributes + + # Convert duration (nanoseconds to milliseconds) + $durationMs = if ($attrs.custom.duration) { + [math]::Round([double]$attrs.custom.duration / 1000000.0, 2) + } else { + 0 + } + + [PSCustomObject]@{ + SpanId = $attrs.span_id + ParentId = $attrs.parent_id + OperationName = $attrs.operation_name + ResourceName = $attrs.resource_name + Status = $attrs.status + Duration = $durationMs + Process = $attrs.custom.aas.function.process + Service = $attrs.service + } + } + + switch ($OutputFormat) { + "table" { + Write-Host "`nFound $($spans.Count) spans in trace:" -ForegroundColor Green + Write-Host "Trace ID: $TraceId`n" -ForegroundColor Cyan + + $spans | Format-Table -Property @( + @{Label = "Operation"; Expression = { $_.OperationName }; Width = 25 } + @{Label = "Resource"; Expression = { $_.ResourceName }; Width = 30 } + @{Label = "Span ID"; Expression = { $_.SpanId }; Width = 20 } + @{Label = "Parent ID"; Expression = { $_.ParentId }; Width = 20 } + @{Label = "Process"; Expression = { $_.Process }; Width = 8 } + @{Label = "Duration (ms)"; Expression = { $_.Duration }; Width = 12 } + ) -AutoSize + } + "json" { + $spans | ConvertTo-Json -Depth 10 + } + "hierarchy" { + Write-Host "`nFound $($spans.Count) spans in trace:" -ForegroundColor Green + Write-Host "Trace ID: $TraceId`n" -ForegroundColor Cyan + + # Build hierarchy + $spanMap = @{} + foreach ($span in $spans) { + $spanMap[$span.SpanId] = $span + } + + function Show-SpanHierarchy { + param( + [PSCustomObject]$Span, + [int]$Indent = 0, + [hashtable]$SpanMap, + [PSCustomObject[]]$AllSpans + ) + + $prefix = " " * $Indent + $arrow = if ($Indent -gt 0) { "└─ " } else { "" } + + $processTag = if ($Span.Process) { " [$($Span.Process)]" } else { "" } + Write-Host "$prefix$arrow$($Span.OperationName)$processTag" -ForegroundColor Cyan + Write-Host "$prefix Resource: $($Span.ResourceName)" -ForegroundColor Gray + Write-Host "$prefix Span ID: $($Span.SpanId), Parent: $($Span.ParentId)" -ForegroundColor DarkGray + Write-Host "$prefix Duration: $($Span.Duration) ms`n" -ForegroundColor DarkGray + + # Find children + $children = $AllSpans | Where-Object { $_.ParentId -eq $Span.SpanId } + foreach ($child in $children) { + Show-SpanHierarchy -Span $child -Indent ($Indent + 1) -SpanMap $SpanMap -AllSpans $AllSpans + } + } + + # Find root span(s) (parent_id = 0) + $rootSpans = $spans | Where-Object { $_.ParentId -eq "0" } + + if ($rootSpans.Count -eq 0) { + Write-Warning "No root span found (parent_id = 0). Showing all spans:" + $spans | ForEach-Object { + Show-SpanHierarchy -Span $_ -Indent 0 -SpanMap $spanMap -AllSpans $spans + } + } else { + foreach ($root in $rootSpans) { + Show-SpanHierarchy -Span $root -Indent 0 -SpanMap $spanMap -AllSpans $spans + } + } + } + } + +} catch { + Write-Error "Failed to query Datadog API: $_" + Write-Error $_.Exception.Message + if ($_.ErrorDetails.Message) { + Write-Error "API Response: $($_.ErrorDetails.Message)" + } + exit 1 +}