Attribute Similarity Analyzer

A Python tool for analyzing and identifying similar product attributes in spreadsheets, designed to help standardize attribute names by detecting subtle differences in spelling, capitalization, and formatting.

Features

Detects similarities between attribute names using fuzzy string matching
Identifies specific differences between attributes (case, punctuation, spacing)
Generates detailed Excel reports with:
- Similarity percentages
- Highlighted differences
- Action columns for standardization decisions
- Automatic file versioning
Interactive user interface for setting similarity thresholds
Detailed difference highlighting and comments
Excel output with frozen headers and formatted cells

Requirements

Python 3.10+
Dependencies (specified in requirements.txt):
- pandas
- openpyxl
- thefuzz
- python-Levenshtein
- streamlit
- plotly

Installation

Create a virtual environment:

python3.10 -m venv venv

Activate the virtual environment:

source venv/bin/activate  # On Unix/macOS

Install dependencies:

pip install -r requirements.txt

Usage

Option 1: Streamlit Web Interface (Recommended)

Run the Streamlit app:

streamlit run app.py

Open your web browser to the displayed URL (typically http://localhost:8501)
Use the web interface to:
- Upload your Excel file
- Set the similarity threshold
- View and download results
- Explore data visualizations

Option 2: Command Line Interface

Place your Excel file containing attributes in the project directory
Run the script:

python attribute_analyzer.py

Enter the minimum similarity threshold when prompted (0-100)
Review the generated Excel file with the prefix "similarity_"

Output Format

The generated Excel file contains the following columns:

Pair ID: Unique identifier for each pair of similar attributes
Attribute: The attribute names being compared
Similarity %: Percentage indicating how similar the attributes are
Differences: Explicit description of what differs between the attributes
Proposed Catsy Key: (For user input) Standardized key name
Merge or Keep Separate?: (For user input) Decision column
Pair ID to merge with: (For user input) For indicating related pairs
NOTES: (For user input) Additional comments or observations

Architecture

graph TD
    User[User] --> |Accesses| WebUI[Streamlit Web Interface]
    User --> |Or Runs| CLI[Command Line Interface]
    
    WebUI --> |Runs| App[app.py]
    CLI --> |Runs| Script[attribute_analyzer.py]
    
    subgraph Frontend
        WebUI --> |Provides| Upload[File Upload]
        WebUI --> |Sets| WebThreshold[Similarity Threshold]
        WebUI --> |Shows| Viz[Data Visualizations]
        WebUI --> |Downloads| Results[Results]
    end
    
    subgraph Backend
        App --> |Imports| Analyzer[attribute_analyzer.py]
        Script --> |Prompts| Threshold[Enter Similarity Threshold]
        
        Analyzer --> |Contains| FindSimilar[find_similar_attributes]
        FindSimilar --> |Uses| Normalize[normalize_text]
        FindSimilar --> |Uses| CaseCheck[are_case_variants]
        
        FindSimilar --> |Returns| Groups[Similar Groups]
        Groups --> |Passed to| Print[print_similar_groups]
        Groups --> |Passed to| Export[export_to_excel]
        
        Export --> |Uses| GetFilename[get_unique_filename]
        Export --> |Uses| FindDiff[find_differences]
    end
    
    subgraph Excel Generation
        Export --> |Creates| Headers[Create Headers]
        Export --> |Formats| Styles[Apply Styles]
        Export --> |Writes| Data[Write Data]
        Export --> |Adds| Comments[Add Comments]
        Export --> |Sets| Format[Set Number Format]
        Export --> |Freezes| Panes[Freeze Panes]
    end
    
    Export --> |Saves| Output[Excel Output File]
    Output --> |Reviews| User

Function Descriptions

normalize_text: Standardizes text for comparison by handling case, spaces, and punctuation
are_case_variants: Checks if two strings differ only in capitalization
find_differences: Identifies specific differences between two strings
find_similar_attributes: Core function that finds similar attributes using fuzzy matching
print_similar_groups: Displays results in console with formatted output
get_unique_filename: Handles file naming with automatic versioning
export_to_excel: Creates formatted Excel output with all findings

Output File Naming

The script automatically handles file naming to prevent overwrites:

First output: similarity_[original_filename].xlsx
Subsequent runs: similarity_[original_filename]_1.xlsx, _2.xlsx, etc.

Notes

Similarity threshold is adjustable (0-100%)
Excel comments show specific differences when hovering over cells
Headers are frozen for easy navigation
Alternating row colors for better readability
Percentage formatting shows whole numbers
Yellow-highlighted action columns for user input

Contributing

Feel free to submit issues, fork the repository, and create pull requests for any improvements.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.devcontainer		.devcontainer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
arch_diagram--string_similarity_analyzer--2024-12-10.png		arch_diagram--string_similarity_analyzer--2024-12-10.png
attribute_analyzer.py		attribute_analyzer.py
daemar-full-attribute-list-for-analysis.xlsx		daemar-full-attribute-list-for-analysis.xlsx
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Attribute Similarity Analyzer

Features

Requirements

Installation

Usage

Option 1: Streamlit Web Interface (Recommended)

Option 2: Command Line Interface

Output Format

Architecture

Function Descriptions

Output File Naming

Notes

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

arkaigrowth/string-similarity-analyzer

Folders and files

Latest commit

History

Repository files navigation

Attribute Similarity Analyzer

Features

Requirements

Installation

Usage

Option 1: Streamlit Web Interface (Recommended)

Option 2: Command Line Interface

Output Format

Architecture

Function Descriptions

Output File Naming

Notes

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages