Author: Guttapati Jayasurya Reddy
MAIL: guttapati.j@ufl.edu
By using this project we can censor various types of sensitive information from text files. The redacted (censored) information includes:
- Names (e.g., "Christian")
- Organizations (e.g., "IPL")
- Dates (e.g., "01/01/2022" or "January 31, 2003")
- Phone numbers (e.g., "(123) 456-7890")
- Addresses and locations (e.g., "SW ARCHER ROAD", "Florida")
- User-specified concepts (e.g., "house")
The code uses the SpaCy NLP library with the English core web medium model (en_core_web_md) for entity recognition and semantic similarity analysis. It also employs pattern matching with regular expressions for some more redactions.
- Customizable Redactions: Users can specify which types of information they want to redact.
- Advanced Concept-Based Redaction: Uses semantic similarity to redact sentences related to specified concepts.
- Batch File Processing: Processes multiple files based on input patterns and outputs redacted versions with a
.censoredextension. - Detailed Statistics: Provides statistics on the number of redactions made for each category.
- Structure Preservation: Maintains the original structure of the document, including line breaks and spacing.
- Python 3.12
- SpaCy library with the English core web medium model (
en_core_web_md) - NLTK library
- NumPy library
-
Purpose: Redacts names and organizations from the given document.
-
Process:
- Iterates through named entities in the document.
- Replaces entities labeled as "PERSON" or "ORG" with █ characters.
-
Output: Returns the text with names and organizations redacted.
-
Purpose: Redacts dates from the given document.
-
Process:
- Iterates through named entities in the document.
- Replaces entities labeled as "DATE" with █ characters.
-
Output: Returns the text with dates redacted.
-
Purpose: Redacts phone numbers from the given document.
-
Process:
- Uses a regular expression to identify phone number patterns.
- Replaces matched phone numbers with █ characters.
-
Output: Returns the text with phone numbers redacted.
-
Purpose: Redacts addresses and locations from the given document.
-
Process:
- Iterates through named entities in the document.
- Replaces entities labeled as "LOC" or "GPE" with █ characters.
-
Output: Returns the text with addresses and locations redacted.
- Purpose: Redacts sentences that are semantically similar to a given concept.
- Input:
- An optional similarity threshold (default 0.7).
- Process:
- Computes the vector representation of the concept.
- Splits the document into lines and sentences.
- Computes sentence vectors and their similarities to the concept vector.
- Redacts entire sentences that exceed the similarity threshold or contain the concept.
- Output: Returns the text with concept-related sentences redacted.
- Purpose: Processes a single file, applying all specified redactions.
- Input:
- File path of the document to process.
- Command-line arguments specifying which redactions to apply.
- Process:
- Reads the file content.
- Applies each specified redaction (names, dates, phones, addresses, concepts).
- Keeps track of redaction statistics.
- Writes the redacted content to a new file with a ".censored" extension.
- Output: Returns a dictionary of redaction statistics for the file.
- Purpose: Writes redaction statistics to the specified output.
- Input:
- A dictionary of redaction statistics.
- Output specification (stderr, stdout, or a file path).
- Process:
- Formats the statistics into a readable string.
- Writes the formatted statistics to the specified output.
- Output: None (writes to specified output).
- Purpose: Main function to run the redaction process.
- Process:
- Parses command-line arguments.
- Processes each input file according to the specified glob pattern.
- Collects statistics for all processed files.
- Writes overall statistics to the specified output.
- Output: None (orchestrates the entire redaction process).
- Install Dependencies:
First we have to make sure that
pipenvis installed on our system:pip3 install pipenv pipenv install
- Activate the Pipenv Shell
Enter the virtual environment using:
pipenv shell - Run the Redactor Script:
Execute the following command to redact sensitive information from text files. We can customize the flags as we need:
pipenv run python redactor.py --input '*.txt' \ --names --dates --phones --address \ --concept 'house' \ --output 'files/' \ --stats stderr
`input` '*.txt': Specifies the pattern to match input files (in this case, all .txt files in the directory). \
`names` Redacts names and organizations.\
`dates` Redacts date information.\
`phones` Redacts phone numbers.\
`address` Redacts addresses.\
`concept` 'lunch': Redacts sentences containing the specified concept (e.g., 'lunch').\
`output` 'files/': Sets the directory where redacted files will be saved.\
`stats` stderr: Outputs statistics to standard error. Replace with 'stdout' or a file path if preferred.\
4. **Run pytest**\
Use the following command:
```bash
pipenv run python -m pytest
File: files/stats.txt
Names: X
Dates: Y
Phones: Z
Addresses: W
Concepts: V
-
Partial Word Redaction: The code maynot redact parts of words that are similar to concept input, leading to unintended censorship.
-
Over-redaction in Concept Matching: The semantic similarity threshold might cause over-redaction, censoring sentences that are only tangentially related to the specified concept.
-
Inconsistent Date Format Recognition: Some uncommon date formats might not be recognized and redacted properly.
-
Performance Issues with Large Files: Processing very large text files may lead to high memory usage and slow performance.
-
Named Entity Recognition Accuracy: The code relies on SpaCy's named entity recognition, which may not be 100% accurate, especially for uncommon names or organizations.
-
Concept Similarity Threshold: A fixed similarity threshold (0.7) is used for concept redaction, which may not be optimal for all use cases or concepts.
-
Phone Number Format: The phone number redaction assumes a specific format (e.g., "(123) 456-7890"). Other formats may not be recognized.
-
Address Recognition: The code assumes that addresses are recognized by SpaCy as "LOC" or "GPE" entities, which may not cover all types of address formats.
-
Preservation of Original Structure: The code basically tries to maintain thea input structure, but this may not be perfect in all cases, especially with complex formatting.