|
| 1 | +--- |
| 2 | +number: 201 |
| 3 | +title: Accurate Lines of Code Calculation |
| 4 | +category: foundation |
| 5 | +priority: high |
| 6 | +status: draft |
| 7 | +dependencies: [] |
| 8 | +created: 2025-12-11 |
| 9 | +--- |
| 10 | + |
| 11 | +# Specification 201: Accurate Lines of Code Calculation |
| 12 | + |
| 13 | +**Category**: foundation |
| 14 | +**Priority**: high |
| 15 | +**Status**: draft |
| 16 | +**Dependencies**: None |
| 17 | + |
| 18 | +## Context |
| 19 | + |
| 20 | +Debtmap's Lines of Code (LOC) calculation has several issues that affect the accuracy of debt density metrics: |
| 21 | + |
| 22 | +### Issue 1: Total LOC Only Includes Files With Debt Items |
| 23 | + |
| 24 | +**Current Behavior** (`src/priority/mod.rs:886-937`): |
| 25 | +```rust |
| 26 | +for item in &self.items { |
| 27 | + if let Some(line_count) = item.file_line_count { |
| 28 | + unique_files.entry(item.location.file.clone()).or_insert(line_count); |
| 29 | + } |
| 30 | +} |
| 31 | +let total_lines_of_code: usize = unique_files.values().sum(); |
| 32 | +``` |
| 33 | + |
| 34 | +**Problem**: Only files that have at least one identified debt item are included in the total LOC count. Files without issues are excluded, significantly understating codebase size. |
| 35 | + |
| 36 | +**Evidence**: Self-analysis shows 153K reported LOC vs 247K actual physical lines (~62% of codebase missing from count). |
| 37 | + |
| 38 | +**Impact**: Debt density is artificially inflated because the denominator (total LOC) excludes clean files. |
| 39 | + |
| 40 | +### Issue 2: Multi-line Comment Detection is Incomplete |
| 41 | + |
| 42 | +**Current Behavior** (`src/metrics/loc_counter.rs:235-253`): |
| 43 | +```rust |
| 44 | +fn is_comment_line(trimmed_line: &str) -> bool { |
| 45 | + trimmed_line.starts_with("//") |
| 46 | + || trimmed_line.starts_with("/*") |
| 47 | + || (trimmed_line.starts_with('*') && !trimmed_line.starts_with("*/")) |
| 48 | + || trimmed_line.starts_with('#') // Python |
| 49 | +} |
| 50 | +``` |
| 51 | + |
| 52 | +**Problem**: No state tracking for multi-line `/* ... */` blocks. Lines in the middle of block comments that don't start with `*` are miscounted as code. |
| 53 | + |
| 54 | +**Example miscounted as code**: |
| 55 | +```rust |
| 56 | +/* |
| 57 | +This line starts with text, not * |
| 58 | +so it's incorrectly counted as code |
| 59 | +*/ |
| 60 | +``` |
| 61 | + |
| 62 | +### Issue 3: Rust Attributes Miscounted as Python Comments |
| 63 | + |
| 64 | +**Current Behavior**: `#[derive(...)]` and other Rust attributes match the Python comment pattern (`trimmed_line.starts_with('#')`). |
| 65 | + |
| 66 | +**Problem**: Rust attributes should be counted as code, not comments. |
| 67 | + |
| 68 | +### Issue 4: Function Length Includes All Physical Lines |
| 69 | + |
| 70 | +**Current Behavior** (`src/analyzers/rust_complexity_calculation.rs:181-193`): |
| 71 | +```rust |
| 72 | +pub fn count_function_lines(item_fn: &syn::ItemFn) -> usize { |
| 73 | + let span = item_fn.span(); |
| 74 | + end_line - start_line + 1 |
| 75 | +} |
| 76 | +``` |
| 77 | + |
| 78 | +**Problem**: This counts all physical lines including blanks and comments, which is inconsistent with common SLOC (Source Lines of Code) definitions. Not necessarily wrong, but should be documented. |
| 79 | + |
| 80 | +## Objective |
| 81 | + |
| 82 | +Fix LOC calculation to: |
| 83 | + |
| 84 | +1. **Include all analyzed files in total LOC** - not just files with debt items |
| 85 | +2. **Correctly handle multi-line comments** - track comment block state |
| 86 | +3. **Distinguish Rust attributes from Python comments** - language-aware comment detection |
| 87 | +4. **Document the LOC methodology** - clarify what "LOC" means in debtmap context |
| 88 | + |
| 89 | +## Requirements |
| 90 | + |
| 91 | +### Functional Requirements |
| 92 | + |
| 93 | +1. **FR-1: Count All Analyzed Files** |
| 94 | + - Track all files discovered during analysis, not just those with debt items |
| 95 | + - Pass the complete file list to LOC calculation |
| 96 | + - Use the existing `LocCounter` for consistent counting |
| 97 | + |
| 98 | +2. **FR-2: Track Multi-line Comment State** |
| 99 | + - Implement state machine for block comment tracking |
| 100 | + - Handle `/* ... */` blocks that span multiple lines |
| 101 | + - Handle nested block comments if applicable (Rust supports `/* /* */ */`) |
| 102 | + - Correctly identify lines within comment blocks as comments |
| 103 | + |
| 104 | +3. **FR-3: Language-Aware Comment Detection** |
| 105 | + - Accept file extension or language hint in `count_content` |
| 106 | + - For Rust files: Don't treat `#[...]` as comments |
| 107 | + - For Python files: Continue treating `#` as comments |
| 108 | + - For JS/TS files: Handle `//` and `/* */` appropriately |
| 109 | + |
| 110 | +4. **FR-4: Maintain Backward Compatibility** |
| 111 | + - `physical_lines` should still represent raw line count |
| 112 | + - `code_lines` should be accurate SLOC (excluding comments and blanks) |
| 113 | + - Existing APIs should not break |
| 114 | + |
| 115 | +### Non-Functional Requirements |
| 116 | + |
| 117 | +1. **NFR-1: Performance** |
| 118 | + - File counting should remain O(n) where n is file size |
| 119 | + - No additional file I/O beyond what's already done |
| 120 | + - Cache file line counts during analysis phase |
| 121 | + |
| 122 | +2. **NFR-2: Accuracy** |
| 123 | + - Total LOC should match `find . -name "*.rs" | xargs wc -l` for physical lines |
| 124 | + - Comment detection should be >95% accurate for standard code patterns |
| 125 | + |
| 126 | +3. **NFR-3: Testability** |
| 127 | + - All comment detection logic must have unit tests |
| 128 | + - Multi-line comment edge cases must be covered |
| 129 | + |
| 130 | +## Acceptance Criteria |
| 131 | + |
| 132 | +- [ ] Running `debtmap analyze src/` reports total LOC that matches `wc -l` output (±5% for physical lines) |
| 133 | +- [ ] Multi-line block comments are correctly detected as comments, not code |
| 134 | +- [ ] Rust attributes `#[derive(...)]` are counted as code, not comments |
| 135 | +- [ ] Files with no debt items still contribute to total LOC count |
| 136 | +- [ ] `LocCount.code_lines + comment_lines + blank_lines == physical_lines` invariant holds |
| 137 | +- [ ] All existing tests continue to pass |
| 138 | +- [ ] New unit tests cover multi-line comment detection |
| 139 | +- [ ] Debt density calculation uses accurate total LOC denominator |
| 140 | + |
| 141 | +## Technical Details |
| 142 | + |
| 143 | +### Implementation Approach |
| 144 | + |
| 145 | +#### Phase 1: Fix Total LOC Calculation |
| 146 | + |
| 147 | +1. Modify analysis pipeline to collect all discovered file paths |
| 148 | +2. Pass complete file list to `calculate_total_impact` |
| 149 | +3. Count LOC for all files, not just those in `self.items` |
| 150 | + |
| 151 | +**Location**: `src/priority/mod.rs` - add `analyzed_files: Vec<PathBuf>` field |
| 152 | + |
| 153 | +```rust |
| 154 | +// In calculate_total_impact: |
| 155 | +let mut unique_files: HashMap<PathBuf, usize> = HashMap::new(); |
| 156 | + |
| 157 | +// First, count all analyzed files |
| 158 | +for file_path in &self.analyzed_files { |
| 159 | + if let Ok(count) = loc_counter.count_file(file_path) { |
| 160 | + unique_files.insert(file_path.clone(), count.physical_lines); |
| 161 | + } |
| 162 | +} |
| 163 | + |
| 164 | +// Items can still override with cached counts (for consistency) |
| 165 | +for item in &self.items { |
| 166 | + if let Some(line_count) = item.file_line_count { |
| 167 | + unique_files.insert(item.location.file.clone(), line_count); |
| 168 | + } |
| 169 | +} |
| 170 | +``` |
| 171 | + |
| 172 | +#### Phase 2: Multi-line Comment State Tracking |
| 173 | + |
| 174 | +Modify `LocCounter::count_content` to track block comment state: |
| 175 | + |
| 176 | +```rust |
| 177 | +pub fn count_content(&self, content: &str, language: Option<Language>) -> LocCount { |
| 178 | + let mut in_block_comment = false; |
| 179 | + let mut block_comment_depth = 0; // For nested comments (Rust) |
| 180 | + |
| 181 | + for line in content.lines() { |
| 182 | + let trimmed = line.trim(); |
| 183 | + |
| 184 | + // Process multi-line comment boundaries |
| 185 | + let (is_comment, new_in_block, new_depth) = |
| 186 | + classify_line(trimmed, in_block_comment, block_comment_depth, language); |
| 187 | + |
| 188 | + in_block_comment = new_in_block; |
| 189 | + block_comment_depth = new_depth; |
| 190 | + |
| 191 | + // Categorize line |
| 192 | + if trimmed.is_empty() { |
| 193 | + blank_lines += 1; |
| 194 | + } else if is_comment { |
| 195 | + comment_lines += 1; |
| 196 | + } else { |
| 197 | + code_lines += 1; |
| 198 | + } |
| 199 | + } |
| 200 | +} |
| 201 | +``` |
| 202 | + |
| 203 | +#### Phase 3: Language-Aware Comment Detection |
| 204 | + |
| 205 | +Add language parameter to counting functions: |
| 206 | + |
| 207 | +```rust |
| 208 | +#[derive(Clone, Copy, Debug)] |
| 209 | +pub enum Language { |
| 210 | + Rust, |
| 211 | + Python, |
| 212 | + JavaScript, |
| 213 | + TypeScript, |
| 214 | + Unknown, |
| 215 | +} |
| 216 | + |
| 217 | +fn classify_line( |
| 218 | + trimmed: &str, |
| 219 | + in_block: bool, |
| 220 | + depth: usize, |
| 221 | + language: Option<Language> |
| 222 | +) -> (bool, bool, usize) { |
| 223 | + // ... language-specific logic |
| 224 | +} |
| 225 | +``` |
| 226 | + |
| 227 | +For Rust specifically: |
| 228 | +- `//` starts single-line comment |
| 229 | +- `/*` starts block comment (nestable) |
| 230 | +- `*/` ends block comment |
| 231 | +- `#[...]` is an attribute (code, not comment) |
| 232 | +- `#![...]` is an inner attribute (code, not comment) |
| 233 | + |
| 234 | +### Architecture Changes |
| 235 | + |
| 236 | +1. Add `analyzed_files: Vec<PathBuf>` to `UnifiedAnalysis` struct |
| 237 | +2. Populate `analyzed_files` during file discovery phase |
| 238 | +3. Add `Language` parameter to `LocCounter::count_content` |
| 239 | +4. Add state tracking fields for block comment detection |
| 240 | + |
| 241 | +### Data Structures |
| 242 | + |
| 243 | +```rust |
| 244 | +// Extended LocCountingConfig |
| 245 | +pub struct LocCountingConfig { |
| 246 | + pub include_tests: bool, |
| 247 | + pub include_generated: bool, |
| 248 | + pub count_comments: bool, |
| 249 | + pub count_blanks: bool, |
| 250 | + pub exclude_patterns: Vec<String>, |
| 251 | + pub language_detection: bool, // NEW: auto-detect from extension |
| 252 | +} |
| 253 | + |
| 254 | +// Comment tracking state (internal) |
| 255 | +struct CommentState { |
| 256 | + in_block_comment: bool, |
| 257 | + block_depth: usize, // For Rust nested comments |
| 258 | +} |
| 259 | +``` |
| 260 | + |
| 261 | +### APIs and Interfaces |
| 262 | + |
| 263 | +```rust |
| 264 | +// Extended method signature |
| 265 | +impl LocCounter { |
| 266 | + /// Count lines with optional language hint |
| 267 | + pub fn count_content_with_language( |
| 268 | + &self, |
| 269 | + content: &str, |
| 270 | + language: Option<Language> |
| 271 | + ) -> LocCount; |
| 272 | + |
| 273 | + /// Count file with auto-detected language |
| 274 | + pub fn count_file(&self, path: &Path) -> Result<LocCount, io::Error> { |
| 275 | + let language = Language::from_extension(path.extension()); |
| 276 | + let content = fs::read_to_string(path)?; |
| 277 | + Ok(self.count_content_with_language(&content, Some(language))) |
| 278 | + } |
| 279 | +} |
| 280 | +``` |
| 281 | + |
| 282 | +## Dependencies |
| 283 | + |
| 284 | +- **Prerequisites**: None |
| 285 | +- **Affected Components**: |
| 286 | + - `src/metrics/loc_counter.rs` - Core counting logic |
| 287 | + - `src/priority/mod.rs` - Total LOC aggregation |
| 288 | + - `src/priority/scoring/construction.rs` - File line count caching |
| 289 | + - `src/analyzers/rust.rs` - File discovery |
| 290 | +- **External Dependencies**: None |
| 291 | + |
| 292 | +## Testing Strategy |
| 293 | + |
| 294 | +### Unit Tests |
| 295 | + |
| 296 | +1. **Multi-line Comment Tests** |
| 297 | + ```rust |
| 298 | + #[test] |
| 299 | + fn test_multiline_block_comment() { |
| 300 | + let code = "/* comment\nstill comment\nend */\ncode"; |
| 301 | + let count = counter.count_content_with_language(code, Some(Language::Rust)); |
| 302 | + assert_eq!(count.comment_lines, 3); |
| 303 | + assert_eq!(count.code_lines, 1); |
| 304 | + } |
| 305 | + |
| 306 | + #[test] |
| 307 | + fn test_nested_block_comments_rust() { |
| 308 | + let code = "/* outer /* inner */ still outer */\ncode"; |
| 309 | + // ... |
| 310 | + } |
| 311 | + ``` |
| 312 | + |
| 313 | +2. **Rust Attribute Tests** |
| 314 | + ```rust |
| 315 | + #[test] |
| 316 | + fn test_rust_attributes_are_code() { |
| 317 | + let code = "#[derive(Debug)]\nstruct Foo;"; |
| 318 | + let count = counter.count_content_with_language(code, Some(Language::Rust)); |
| 319 | + assert_eq!(count.code_lines, 2); |
| 320 | + assert_eq!(count.comment_lines, 0); |
| 321 | + } |
| 322 | + ``` |
| 323 | + |
| 324 | +3. **Total LOC Invariant Tests** |
| 325 | + ```rust |
| 326 | + #[test] |
| 327 | + fn test_loc_invariant() { |
| 328 | + let count = counter.count_content(code); |
| 329 | + assert_eq!( |
| 330 | + count.physical_lines, |
| 331 | + count.code_lines + count.comment_lines + count.blank_lines |
| 332 | + ); |
| 333 | + } |
| 334 | + ``` |
| 335 | + |
| 336 | +### Integration Tests |
| 337 | + |
| 338 | +1. Compare `debtmap analyze .` total LOC with `wc -l` output |
| 339 | +2. Verify debt density changes appropriately with fix |
| 340 | +3. Test on codebases with varying comment styles |
| 341 | + |
| 342 | +### Performance Tests |
| 343 | + |
| 344 | +1. Ensure no regression in analysis time |
| 345 | +2. Profile LOC counting on large files |
| 346 | + |
| 347 | +## Documentation Requirements |
| 348 | + |
| 349 | +- **Code Documentation**: Document the LOC counting methodology in `loc_counter.rs` |
| 350 | +- **User Documentation**: Update any user-facing docs that mention LOC |
| 351 | +- **Architecture Updates**: None required |
| 352 | + |
| 353 | +## Implementation Notes |
| 354 | + |
| 355 | +1. **Backward Compatibility**: The `count_content` method without language parameter should remain functional, defaulting to conservative behavior. |
| 356 | + |
| 357 | +2. **Edge Cases to Handle**: |
| 358 | + - String literals containing comment markers: `"/* not a comment */"` |
| 359 | + - Raw strings in Rust: `r#"/* also not a comment */"#` |
| 360 | + - Heredocs in shell scripts embedded in code |
| 361 | + - Doc comments (`///` and `//!` in Rust) - these ARE comments |
| 362 | + |
| 363 | +3. **String Literal Detection**: Full string literal detection requires parsing. For a simpler solution, accept some inaccuracy for code containing comment markers inside strings. Document this limitation. |
| 364 | + |
| 365 | +4. **Performance**: Reading files for LOC that have already been parsed is wasteful. Consider extracting line counts from the AST during parsing phase when available. |
| 366 | + |
| 367 | +## Migration and Compatibility |
| 368 | + |
| 369 | +- **Breaking Changes**: None - all changes are additive or fix bugs |
| 370 | +- **Migration**: No migration required |
| 371 | +- **Compatibility**: Existing JSON output format unchanged, but `total_loc` values will increase (more accurate) |
| 372 | + |
| 373 | +## Success Metrics |
| 374 | + |
| 375 | +1. **Accuracy**: Total LOC matches `wc -l` within 5% |
| 376 | +2. **Consistency**: `code_lines + comment_lines + blank_lines == physical_lines` always holds |
| 377 | +3. **No Regression**: All existing tests pass |
| 378 | +4. **Meaningful Density**: Debt density reflects actual codebase size |
0 commit comments