Skip to content

Jayasurya003/RedactPro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

THE REDACTION

Author: Guttapati Jayasurya Reddy
MAIL: guttapati.j@ufl.edu

Project Description

By using this project we can censor various types of sensitive information from text files. The redacted (censored) information includes:

  • Names (e.g., "Christian")
  • Organizations (e.g., "IPL")
  • Dates (e.g., "01/01/2022" or "January 31, 2003")
  • Phone numbers (e.g., "(123) 456-7890")
  • Addresses and locations (e.g., "SW ARCHER ROAD", "Florida")
  • User-specified concepts (e.g., "house")

The code uses the SpaCy NLP library with the English core web medium model (en_core_web_md) for entity recognition and semantic similarity analysis. It also employs pattern matching with regular expressions for some more redactions.

Features

  • Customizable Redactions: Users can specify which types of information they want to redact.
  • Advanced Concept-Based Redaction: Uses semantic similarity to redact sentences related to specified concepts.
  • Batch File Processing: Processes multiple files based on input patterns and outputs redacted versions with a .censored extension.
  • Detailed Statistics: Provides statistics on the number of redactions made for each category.
  • Structure Preservation: Maintains the original structure of the document, including line breaks and spacing.

Installation and Setup

Prerequisites

  • Python 3.12
  • SpaCy library with the English core web medium model (en_core_web_md)
  • NLTK library
  • NumPy library

Function Descriptions

redact_names(doc)

  • Purpose: Redacts names and organizations from the given document.

  • Process:

    • Iterates through named entities in the document.
    • Replaces entities labeled as "PERSON" or "ORG" with █ characters.
  • Output: Returns the text with names and organizations redacted.

redact_dates(doc)

  • Purpose: Redacts dates from the given document.

  • Process:

    • Iterates through named entities in the document.
    • Replaces entities labeled as "DATE" with █ characters.
  • Output: Returns the text with dates redacted.

redact_phones(doc)

  • Purpose: Redacts phone numbers from the given document.

  • Process:

    • Uses a regular expression to identify phone number patterns.
    • Replaces matched phone numbers with █ characters.
  • Output: Returns the text with phone numbers redacted.

redact_address(doc)

  • Purpose: Redacts addresses and locations from the given document.

  • Process:

    • Iterates through named entities in the document.
    • Replaces entities labeled as "LOC" or "GPE" with █ characters.
  • Output: Returns the text with addresses and locations redacted.

redact_concept(doc, concept, similarity_threshold=0.7)

  • Purpose: Redacts sentences that are semantically similar to a given concept.
  • Input:
    • An optional similarity threshold (default 0.7).
  • Process:
    • Computes the vector representation of the concept.
    • Splits the document into lines and sentences.
    • Computes sentence vectors and their similarities to the concept vector.
    • Redacts entire sentences that exceed the similarity threshold or contain the concept.
  • Output: Returns the text with concept-related sentences redacted.

process_file(file_path, args)

  • Purpose: Processes a single file, applying all specified redactions.
  • Input:
    • File path of the document to process.
    • Command-line arguments specifying which redactions to apply.
  • Process:
    • Reads the file content.
    • Applies each specified redaction (names, dates, phones, addresses, concepts).
    • Keeps track of redaction statistics.
    • Writes the redacted content to a new file with a ".censored" extension.
  • Output: Returns a dictionary of redaction statistics for the file.

write_stats(stats, output)

  • Purpose: Writes redaction statistics to the specified output.
  • Input:
    • A dictionary of redaction statistics.
    • Output specification (stderr, stdout, or a file path).
  • Process:
    • Formats the statistics into a readable string.
    • Writes the formatted statistics to the specified output.
  • Output: None (writes to specified output).

main()

  • Purpose: Main function to run the redaction process.
  • Process:
    • Parses command-line arguments.
    • Processes each input file according to the specified glob pattern.
    • Collects statistics for all processed files.
    • Writes overall statistics to the specified output.
  • Output: None (orchestrates the entire redaction process).

Step-by-Step Instructions

Installation and Usage

  1. Install Dependencies: First we have to make sure that pipenv is installed on our system:
    pip3 install pipenv
    pipenv install
    
  2. Activate the Pipenv Shell Enter the virtual environment using:
    pipenv shell
             
  3. Run the Redactor Script: Execute the following command to redact sensitive information from text files. We can customize the flags as we need:
    pipenv run python redactor.py --input '*.txt' \
              --names --dates --phones --address \
              --concept 'house' \
              --output 'files/' \
              --stats stderr
`input` '*.txt': Specifies the pattern to match input files (in this case, all .txt files in the directory). \
`names` Redacts names and organizations.\
`dates` Redacts date information.\
`phones` Redacts phone numbers.\
`address` Redacts addresses.\
`concept` 'lunch': Redacts sentences containing the specified concept (e.g., 'lunch').\
`output` 'files/': Sets the directory where redacted files will be saved.\
`stats` stderr: Outputs statistics to standard error. Replace with 'stdout' or a file path if preferred.\


4. **Run pytest**\
Use the following command:
```bash
pipenv run python -m pytest

Redaction Statistics:

File: files/stats.txt Names: X
Dates: Y
Phones: Z
Addresses: W
Concepts: V

Bugs and Assumptions

Known Bugs

  1. Partial Word Redaction: The code maynot redact parts of words that are similar to concept input, leading to unintended censorship.

  2. Over-redaction in Concept Matching: The semantic similarity threshold might cause over-redaction, censoring sentences that are only tangentially related to the specified concept.

  3. Inconsistent Date Format Recognition: Some uncommon date formats might not be recognized and redacted properly.

  4. Performance Issues with Large Files: Processing very large text files may lead to high memory usage and slow performance.

Assumptions

  1. Named Entity Recognition Accuracy: The code relies on SpaCy's named entity recognition, which may not be 100% accurate, especially for uncommon names or organizations.

  2. Concept Similarity Threshold: A fixed similarity threshold (0.7) is used for concept redaction, which may not be optimal for all use cases or concepts.

  3. Phone Number Format: The phone number redaction assumes a specific format (e.g., "(123) 456-7890"). Other formats may not be recognized.

  4. Address Recognition: The code assumes that addresses are recognized by SpaCy as "LOC" or "GPE" entities, which may not cover all types of address formats.

  5. Preservation of Original Structure: The code basically tries to maintain thea input structure, but this may not be perfect in all cases, especially with complex formatting.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages