Skip to content

[triage] upstream#7380: LdSvmTrainer slow behaviour when loading from a remote database #40

@github-actions

Description

@github-actions

Upstream: dotnet#7380
Status: COMPLETE
Classification: bug-report
Confidence: 0.65
Reproduced: ⏭️ Skipped (requires remote database with large dataset)
Area: Core (Microsoft.ML.StandardTrainers / LdSvm)
Investigated at: 2026-03-07


Triage Summary

Category: Bug Report
Reasoning: The user reports that LdSvmTrainer continues to generate excessive network/SQL traffic and is ~100x slower than LightGBM/FastForest despite correctly using mlContext.Data.Cache(train) and setting Cache=true in LdSvmTrainer.Options. The Cache=true option defaults to true and is supposed to load all examples into memory before training, so the behavior of ongoing SQL reads is unexpected. The ML.NET contributor has not been able to explain or resolve the discrepancy despite active investigation through multiple comments.

Summary: With hundreds of columns and millions of rows loaded from a remote PostgreSQL database, LdSvmTrainer runs ~100x slower and uses significantly less memory than LightGBM/FastForest on the same pipeline. The user correctly re-assigns the cached data (train = mlContext.Data.Cache(train)) and sets Cache=true in trainer options, yet the database connection shows ongoing high network traffic rather than a single initial read followed by in-memory training. This suggests the caching mechanism is not preventing repeated reads from the database loader.

Suggested Labels: bug, needs-info

Source Code Analysis

The LdSvmTrainer has two caching layers:

  1. User-level cache via mlContext.Data.Cache(data) — caches raw loaded data before transforms
  2. Trainer-level cache via Cache=true option — implemented in CachedData class in LdSvmTrainer.cs (src/Microsoft.ML.StandardTrainers/LdSvm/LdSvmTrainer.cs, line 275-276, 501-556)

The CachedData constructor reads all transformed examples into a LabelFeatures[] array via a single FloatLabelCursor pass. Once loaded, all 1000 iterations sample from the in-memory array. This should prevent repeat DB reads.

Possible root causes:

  • The per-column NormalizeMinMax in a loop (one transform appended per column) may cause N separate passes through data to fit the pipeline — but this should still use the cached source data, not the DB
  • A potential bug in how DatabaseLoader interacts with DataCache, where the cache is not being correctly hit for repeated enumerations

Reproduction Notes

Reproduction requires:

  • A PostgreSQL (or similar) database with millions of rows and hundreds of columns
  • The issue manifests as high sustained SQL network traffic during pipeline.Fit(), visible via network monitoring

The key reproduction steps from the user's code:

  1. Load via DatabaseLoader + NpgsqlFactory
  2. Cache with train = mlContext.Data.Cache(train)
  3. Build a pipeline with per-column MinMax normalization, type conversion, OneHot encoding, concatenation
  4. Set Cache=true and NumberOfIterations=1000 in LdSvmTrainer.Options
  5. Compare network traffic vs. LightGBM on same pipeline

Generated by Triage Single Issue ·

Metadata

Metadata

Assignees

No one assigned

    Labels

    ai-investigationAI-investigated issuebugSomething isn't workingtriageTriage tracking issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions