Skip to content

Commit 903f1fd

Browse files
committed
add: spec 201 - accurate LOC calculation
Addresses several issues with Lines of Code calculation: 1. Total LOC only counts files with debt items (missing ~38% of codebase) 2. Multi-line block comments not tracked with state machine 3. Rust #[attributes] miscounted as Python comments 4. Function length methodology undocumented Fixes will ensure debt density uses accurate denominator.
1 parent 1d26e22 commit 903f1fd

File tree

1 file changed

+378
-0
lines changed

1 file changed

+378
-0
lines changed
Lines changed: 378 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,378 @@
1+
---
2+
number: 201
3+
title: Accurate Lines of Code Calculation
4+
category: foundation
5+
priority: high
6+
status: draft
7+
dependencies: []
8+
created: 2025-12-11
9+
---
10+
11+
# Specification 201: Accurate Lines of Code Calculation
12+
13+
**Category**: foundation
14+
**Priority**: high
15+
**Status**: draft
16+
**Dependencies**: None
17+
18+
## Context
19+
20+
Debtmap's Lines of Code (LOC) calculation has several issues that affect the accuracy of debt density metrics:
21+
22+
### Issue 1: Total LOC Only Includes Files With Debt Items
23+
24+
**Current Behavior** (`src/priority/mod.rs:886-937`):
25+
```rust
26+
for item in &self.items {
27+
if let Some(line_count) = item.file_line_count {
28+
unique_files.entry(item.location.file.clone()).or_insert(line_count);
29+
}
30+
}
31+
let total_lines_of_code: usize = unique_files.values().sum();
32+
```
33+
34+
**Problem**: Only files that have at least one identified debt item are included in the total LOC count. Files without issues are excluded, significantly understating codebase size.
35+
36+
**Evidence**: Self-analysis shows 153K reported LOC vs 247K actual physical lines (~62% of codebase missing from count).
37+
38+
**Impact**: Debt density is artificially inflated because the denominator (total LOC) excludes clean files.
39+
40+
### Issue 2: Multi-line Comment Detection is Incomplete
41+
42+
**Current Behavior** (`src/metrics/loc_counter.rs:235-253`):
43+
```rust
44+
fn is_comment_line(trimmed_line: &str) -> bool {
45+
trimmed_line.starts_with("//")
46+
|| trimmed_line.starts_with("/*")
47+
|| (trimmed_line.starts_with('*') && !trimmed_line.starts_with("*/"))
48+
|| trimmed_line.starts_with('#') // Python
49+
}
50+
```
51+
52+
**Problem**: No state tracking for multi-line `/* ... */` blocks. Lines in the middle of block comments that don't start with `*` are miscounted as code.
53+
54+
**Example miscounted as code**:
55+
```rust
56+
/*
57+
This line starts with text, not *
58+
so it's incorrectly counted as code
59+
*/
60+
```
61+
62+
### Issue 3: Rust Attributes Miscounted as Python Comments
63+
64+
**Current Behavior**: `#[derive(...)]` and other Rust attributes match the Python comment pattern (`trimmed_line.starts_with('#')`).
65+
66+
**Problem**: Rust attributes should be counted as code, not comments.
67+
68+
### Issue 4: Function Length Includes All Physical Lines
69+
70+
**Current Behavior** (`src/analyzers/rust_complexity_calculation.rs:181-193`):
71+
```rust
72+
pub fn count_function_lines(item_fn: &syn::ItemFn) -> usize {
73+
let span = item_fn.span();
74+
end_line - start_line + 1
75+
}
76+
```
77+
78+
**Problem**: This counts all physical lines including blanks and comments, which is inconsistent with common SLOC (Source Lines of Code) definitions. Not necessarily wrong, but should be documented.
79+
80+
## Objective
81+
82+
Fix LOC calculation to:
83+
84+
1. **Include all analyzed files in total LOC** - not just files with debt items
85+
2. **Correctly handle multi-line comments** - track comment block state
86+
3. **Distinguish Rust attributes from Python comments** - language-aware comment detection
87+
4. **Document the LOC methodology** - clarify what "LOC" means in debtmap context
88+
89+
## Requirements
90+
91+
### Functional Requirements
92+
93+
1. **FR-1: Count All Analyzed Files**
94+
- Track all files discovered during analysis, not just those with debt items
95+
- Pass the complete file list to LOC calculation
96+
- Use the existing `LocCounter` for consistent counting
97+
98+
2. **FR-2: Track Multi-line Comment State**
99+
- Implement state machine for block comment tracking
100+
- Handle `/* ... */` blocks that span multiple lines
101+
- Handle nested block comments if applicable (Rust supports `/* /* */ */`)
102+
- Correctly identify lines within comment blocks as comments
103+
104+
3. **FR-3: Language-Aware Comment Detection**
105+
- Accept file extension or language hint in `count_content`
106+
- For Rust files: Don't treat `#[...]` as comments
107+
- For Python files: Continue treating `#` as comments
108+
- For JS/TS files: Handle `//` and `/* */` appropriately
109+
110+
4. **FR-4: Maintain Backward Compatibility**
111+
- `physical_lines` should still represent raw line count
112+
- `code_lines` should be accurate SLOC (excluding comments and blanks)
113+
- Existing APIs should not break
114+
115+
### Non-Functional Requirements
116+
117+
1. **NFR-1: Performance**
118+
- File counting should remain O(n) where n is file size
119+
- No additional file I/O beyond what's already done
120+
- Cache file line counts during analysis phase
121+
122+
2. **NFR-2: Accuracy**
123+
- Total LOC should match `find . -name "*.rs" | xargs wc -l` for physical lines
124+
- Comment detection should be >95% accurate for standard code patterns
125+
126+
3. **NFR-3: Testability**
127+
- All comment detection logic must have unit tests
128+
- Multi-line comment edge cases must be covered
129+
130+
## Acceptance Criteria
131+
132+
- [ ] Running `debtmap analyze src/` reports total LOC that matches `wc -l` output (±5% for physical lines)
133+
- [ ] Multi-line block comments are correctly detected as comments, not code
134+
- [ ] Rust attributes `#[derive(...)]` are counted as code, not comments
135+
- [ ] Files with no debt items still contribute to total LOC count
136+
- [ ] `LocCount.code_lines + comment_lines + blank_lines == physical_lines` invariant holds
137+
- [ ] All existing tests continue to pass
138+
- [ ] New unit tests cover multi-line comment detection
139+
- [ ] Debt density calculation uses accurate total LOC denominator
140+
141+
## Technical Details
142+
143+
### Implementation Approach
144+
145+
#### Phase 1: Fix Total LOC Calculation
146+
147+
1. Modify analysis pipeline to collect all discovered file paths
148+
2. Pass complete file list to `calculate_total_impact`
149+
3. Count LOC for all files, not just those in `self.items`
150+
151+
**Location**: `src/priority/mod.rs` - add `analyzed_files: Vec<PathBuf>` field
152+
153+
```rust
154+
// In calculate_total_impact:
155+
let mut unique_files: HashMap<PathBuf, usize> = HashMap::new();
156+
157+
// First, count all analyzed files
158+
for file_path in &self.analyzed_files {
159+
if let Ok(count) = loc_counter.count_file(file_path) {
160+
unique_files.insert(file_path.clone(), count.physical_lines);
161+
}
162+
}
163+
164+
// Items can still override with cached counts (for consistency)
165+
for item in &self.items {
166+
if let Some(line_count) = item.file_line_count {
167+
unique_files.insert(item.location.file.clone(), line_count);
168+
}
169+
}
170+
```
171+
172+
#### Phase 2: Multi-line Comment State Tracking
173+
174+
Modify `LocCounter::count_content` to track block comment state:
175+
176+
```rust
177+
pub fn count_content(&self, content: &str, language: Option<Language>) -> LocCount {
178+
let mut in_block_comment = false;
179+
let mut block_comment_depth = 0; // For nested comments (Rust)
180+
181+
for line in content.lines() {
182+
let trimmed = line.trim();
183+
184+
// Process multi-line comment boundaries
185+
let (is_comment, new_in_block, new_depth) =
186+
classify_line(trimmed, in_block_comment, block_comment_depth, language);
187+
188+
in_block_comment = new_in_block;
189+
block_comment_depth = new_depth;
190+
191+
// Categorize line
192+
if trimmed.is_empty() {
193+
blank_lines += 1;
194+
} else if is_comment {
195+
comment_lines += 1;
196+
} else {
197+
code_lines += 1;
198+
}
199+
}
200+
}
201+
```
202+
203+
#### Phase 3: Language-Aware Comment Detection
204+
205+
Add language parameter to counting functions:
206+
207+
```rust
208+
#[derive(Clone, Copy, Debug)]
209+
pub enum Language {
210+
Rust,
211+
Python,
212+
JavaScript,
213+
TypeScript,
214+
Unknown,
215+
}
216+
217+
fn classify_line(
218+
trimmed: &str,
219+
in_block: bool,
220+
depth: usize,
221+
language: Option<Language>
222+
) -> (bool, bool, usize) {
223+
// ... language-specific logic
224+
}
225+
```
226+
227+
For Rust specifically:
228+
- `//` starts single-line comment
229+
- `/*` starts block comment (nestable)
230+
- `*/` ends block comment
231+
- `#[...]` is an attribute (code, not comment)
232+
- `#![...]` is an inner attribute (code, not comment)
233+
234+
### Architecture Changes
235+
236+
1. Add `analyzed_files: Vec<PathBuf>` to `UnifiedAnalysis` struct
237+
2. Populate `analyzed_files` during file discovery phase
238+
3. Add `Language` parameter to `LocCounter::count_content`
239+
4. Add state tracking fields for block comment detection
240+
241+
### Data Structures
242+
243+
```rust
244+
// Extended LocCountingConfig
245+
pub struct LocCountingConfig {
246+
pub include_tests: bool,
247+
pub include_generated: bool,
248+
pub count_comments: bool,
249+
pub count_blanks: bool,
250+
pub exclude_patterns: Vec<String>,
251+
pub language_detection: bool, // NEW: auto-detect from extension
252+
}
253+
254+
// Comment tracking state (internal)
255+
struct CommentState {
256+
in_block_comment: bool,
257+
block_depth: usize, // For Rust nested comments
258+
}
259+
```
260+
261+
### APIs and Interfaces
262+
263+
```rust
264+
// Extended method signature
265+
impl LocCounter {
266+
/// Count lines with optional language hint
267+
pub fn count_content_with_language(
268+
&self,
269+
content: &str,
270+
language: Option<Language>
271+
) -> LocCount;
272+
273+
/// Count file with auto-detected language
274+
pub fn count_file(&self, path: &Path) -> Result<LocCount, io::Error> {
275+
let language = Language::from_extension(path.extension());
276+
let content = fs::read_to_string(path)?;
277+
Ok(self.count_content_with_language(&content, Some(language)))
278+
}
279+
}
280+
```
281+
282+
## Dependencies
283+
284+
- **Prerequisites**: None
285+
- **Affected Components**:
286+
- `src/metrics/loc_counter.rs` - Core counting logic
287+
- `src/priority/mod.rs` - Total LOC aggregation
288+
- `src/priority/scoring/construction.rs` - File line count caching
289+
- `src/analyzers/rust.rs` - File discovery
290+
- **External Dependencies**: None
291+
292+
## Testing Strategy
293+
294+
### Unit Tests
295+
296+
1. **Multi-line Comment Tests**
297+
```rust
298+
#[test]
299+
fn test_multiline_block_comment() {
300+
let code = "/* comment\nstill comment\nend */\ncode";
301+
let count = counter.count_content_with_language(code, Some(Language::Rust));
302+
assert_eq!(count.comment_lines, 3);
303+
assert_eq!(count.code_lines, 1);
304+
}
305+
306+
#[test]
307+
fn test_nested_block_comments_rust() {
308+
let code = "/* outer /* inner */ still outer */\ncode";
309+
// ...
310+
}
311+
```
312+
313+
2. **Rust Attribute Tests**
314+
```rust
315+
#[test]
316+
fn test_rust_attributes_are_code() {
317+
let code = "#[derive(Debug)]\nstruct Foo;";
318+
let count = counter.count_content_with_language(code, Some(Language::Rust));
319+
assert_eq!(count.code_lines, 2);
320+
assert_eq!(count.comment_lines, 0);
321+
}
322+
```
323+
324+
3. **Total LOC Invariant Tests**
325+
```rust
326+
#[test]
327+
fn test_loc_invariant() {
328+
let count = counter.count_content(code);
329+
assert_eq!(
330+
count.physical_lines,
331+
count.code_lines + count.comment_lines + count.blank_lines
332+
);
333+
}
334+
```
335+
336+
### Integration Tests
337+
338+
1. Compare `debtmap analyze .` total LOC with `wc -l` output
339+
2. Verify debt density changes appropriately with fix
340+
3. Test on codebases with varying comment styles
341+
342+
### Performance Tests
343+
344+
1. Ensure no regression in analysis time
345+
2. Profile LOC counting on large files
346+
347+
## Documentation Requirements
348+
349+
- **Code Documentation**: Document the LOC counting methodology in `loc_counter.rs`
350+
- **User Documentation**: Update any user-facing docs that mention LOC
351+
- **Architecture Updates**: None required
352+
353+
## Implementation Notes
354+
355+
1. **Backward Compatibility**: The `count_content` method without language parameter should remain functional, defaulting to conservative behavior.
356+
357+
2. **Edge Cases to Handle**:
358+
- String literals containing comment markers: `"/* not a comment */"`
359+
- Raw strings in Rust: `r#"/* also not a comment */"#`
360+
- Heredocs in shell scripts embedded in code
361+
- Doc comments (`///` and `//!` in Rust) - these ARE comments
362+
363+
3. **String Literal Detection**: Full string literal detection requires parsing. For a simpler solution, accept some inaccuracy for code containing comment markers inside strings. Document this limitation.
364+
365+
4. **Performance**: Reading files for LOC that have already been parsed is wasteful. Consider extracting line counts from the AST during parsing phase when available.
366+
367+
## Migration and Compatibility
368+
369+
- **Breaking Changes**: None - all changes are additive or fix bugs
370+
- **Migration**: No migration required
371+
- **Compatibility**: Existing JSON output format unchanged, but `total_loc` values will increase (more accurate)
372+
373+
## Success Metrics
374+
375+
1. **Accuracy**: Total LOC matches `wc -l` within 5%
376+
2. **Consistency**: `code_lines + comment_lines + blank_lines == physical_lines` always holds
377+
3. **No Regression**: All existing tests pass
378+
4. **Meaningful Density**: Debt density reflects actual codebase size

0 commit comments

Comments
 (0)