THE REDACTION

Author: Guttapati Jayasurya Reddy
MAIL: guttapati.j@ufl.edu

Project Description

By using this project we can censor various types of sensitive information from text files. The redacted (censored) information includes:

Names (e.g., "Christian")
Organizations (e.g., "IPL")
Dates (e.g., "01/01/2022" or "January 31, 2003")
Phone numbers (e.g., "(123) 456-7890")
Addresses and locations (e.g., "SW ARCHER ROAD", "Florida")
User-specified concepts (e.g., "house")

The code uses the SpaCy NLP library with the English core web medium model (en_core_web_md) for entity recognition and semantic similarity analysis. It also employs pattern matching with regular expressions for some more redactions.

Features

Customizable Redactions: Users can specify which types of information they want to redact.
Advanced Concept-Based Redaction: Uses semantic similarity to redact sentences related to specified concepts.
Batch File Processing: Processes multiple files based on input patterns and outputs redacted versions with a .censored extension.
Detailed Statistics: Provides statistics on the number of redactions made for each category.
Structure Preservation: Maintains the original structure of the document, including line breaks and spacing.

Installation and Setup

Prerequisites

Python 3.12
SpaCy library with the English core web medium model (en_core_web_md)
NLTK library
NumPy library

Function Descriptions

`redact_names(doc)`

Purpose: Redacts names and organizations from the given document.
Process:
- Iterates through named entities in the document.
- Replaces entities labeled as "PERSON" or "ORG" with █ characters.
Output: Returns the text with names and organizations redacted.

`redact_dates(doc)`

Purpose: Redacts dates from the given document.
Process:
- Iterates through named entities in the document.
- Replaces entities labeled as "DATE" with █ characters.
Output: Returns the text with dates redacted.

`redact_phones(doc)`

Purpose: Redacts phone numbers from the given document.
Process:
- Uses a regular expression to identify phone number patterns.
- Replaces matched phone numbers with █ characters.
Output: Returns the text with phone numbers redacted.

`redact_address(doc)`

Purpose: Redacts addresses and locations from the given document.
Process:
- Iterates through named entities in the document.
- Replaces entities labeled as "LOC" or "GPE" with █ characters.
Output: Returns the text with addresses and locations redacted.

`redact_concept(doc, concept, similarity_threshold=0.7)`

Purpose: Redacts sentences that are semantically similar to a given concept.
Input:
- An optional similarity threshold (default 0.7).
Process:
- Computes the vector representation of the concept.
- Splits the document into lines and sentences.
- Computes sentence vectors and their similarities to the concept vector.
- Redacts entire sentences that exceed the similarity threshold or contain the concept.
Output: Returns the text with concept-related sentences redacted.

`process_file(file_path, args)`

Purpose: Processes a single file, applying all specified redactions.
Input:
- File path of the document to process.
- Command-line arguments specifying which redactions to apply.
Process:
- Reads the file content.
- Applies each specified redaction (names, dates, phones, addresses, concepts).
- Keeps track of redaction statistics.
- Writes the redacted content to a new file with a ".censored" extension.
Output: Returns a dictionary of redaction statistics for the file.

`write_stats(stats, output)`

Purpose: Writes redaction statistics to the specified output.
Input:
- A dictionary of redaction statistics.
- Output specification (stderr, stdout, or a file path).
Process:
- Formats the statistics into a readable string.
- Writes the formatted statistics to the specified output.
Output: None (writes to specified output).

`main()`

Purpose: Main function to run the redaction process.
Process:
- Parses command-line arguments.
- Processes each input file according to the specified glob pattern.
- Collects statistics for all processed files.
- Writes overall statistics to the specified output.
Output: None (orchestrates the entire redaction process).

Step-by-Step Instructions

Installation and Usage

Install Dependencies: First we have to make sure that pipenv is installed on our system:
```
pip3 install pipenv
pipenv install
```
Activate the Pipenv Shell Enter the virtual environment using:
```
pipenv shell
         
```

Run the Redactor Script: Execute the following command to redact sensitive information from text files. We can customize the flags as we need:

pipenv run python redactor.py --input '*.txt' \
          --names --dates --phones --address \
          --concept 'house' \
          --output 'files/' \
          --stats stderr

`input` '*.txt': Specifies the pattern to match input files (in this case, all .txt files in the directory). \
`names` Redacts names and organizations.\
`dates` Redacts date information.\
`phones` Redacts phone numbers.\
`address` Redacts addresses.\
`concept` 'lunch': Redacts sentences containing the specified concept (e.g., 'lunch').\
`output` 'files/': Sets the directory where redacted files will be saved.\
`stats` stderr: Outputs statistics to standard error. Replace with 'stdout' or a file path if preferred.\


4. **Run pytest**\
Use the following command:
```bash
pipenv run python -m pytest

Redaction Statistics:

File: files/stats.txt Names: X
Dates: Y
Phones: Z
Addresses: W
Concepts: V

Bugs and Assumptions

Known Bugs

Partial Word Redaction: The code maynot redact parts of words that are similar to concept input, leading to unintended censorship.
Over-redaction in Concept Matching: The semantic similarity threshold might cause over-redaction, censoring sentences that are only tangentially related to the specified concept.
Inconsistent Date Format Recognition: Some uncommon date formats might not be recognized and redacted properly.
Performance Issues with Large Files: Processing very large text files may lead to high memory usage and slow performance.

Assumptions

Named Entity Recognition Accuracy: The code relies on SpaCy's named entity recognition, which may not be 100% accurate, especially for uncommon names or organizations.
Concept Similarity Threshold: A fixed similarity threshold (0.7) is used for concept redaction, which may not be optimal for all use cases or concepts.
Phone Number Format: The phone number redaction assumes a specific format (e.g., "(123) 456-7890"). Other formats may not be recognized.
Address Recognition: The code assumes that addresses are recognized by SpaCy as "LOC" or "GPE" entities, which may not cover all types of address formats.
Preservation of Original Structure: The code basically tries to maintain thea input structure, but this may not be perfect in all cases, especially with complex formatting.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
files		files
tests		tests
COLLABORATORS.md		COLLABORATORS.md
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
hi.txt		hi.txt
metadata.yml		metadata.yml
redactor.py		redactor.py
setup.cfg		setup.cfg
setup.py		setup.py
stder		stder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

THE REDACTION

Project Description

Features

Installation and Setup

Prerequisites

Function Descriptions

`redact_names(doc)`

`redact_dates(doc)`

`redact_phones(doc)`

`redact_address(doc)`

`redact_concept(doc, concept, similarity_threshold=0.7)`

`process_file(file_path, args)`

`write_stats(stats, output)`

`main()`

Step-by-Step Instructions

Installation and Usage

Redaction Statistics:

Bugs and Assumptions

Known Bugs

Assumptions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

THE REDACTION

Project Description

Features

Installation and Setup

Prerequisites

Function Descriptions

redact_names(doc)

redact_dates(doc)

redact_phones(doc)

redact_address(doc)

redact_concept(doc, concept, similarity_threshold=0.7)

process_file(file_path, args)

write_stats(stats, output)

main()

Step-by-Step Instructions

Installation and Usage

Redaction Statistics:

Bugs and Assumptions

Known Bugs

Assumptions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`redact_names(doc)`

`redact_dates(doc)`

`redact_phones(doc)`

`redact_address(doc)`

`redact_concept(doc, concept, similarity_threshold=0.7)`

`process_file(file_path, args)`

`write_stats(stats, output)`

`main()`

Packages