Skip to content

kalle07/raw-txt-snippet-creator

Repository files navigation

raw-txt-snippet-creator

Actual version: v08alpha
Buzzword search with "AND" option within distance. Its like an embedder only with plain txt search!
It's like opening a text editor, searching for a keyword, and finding X hits. Now the snippet extractor cuts out a section around each keyword and save. The maximum text found is never larger than the original text, as overlapping sections are merged!
-> All is in character and percent
-> Keep in mind 5000characters ~1200token (aprox one book page) -> If you are searching for the phrase "blue care", please note that this part will not be found if there is a line break between the two words.
Best in combination with my PDF Parser: https://github.com/kalle07/parsing

EXE on huggingface or relases(right side):
https://huggingface.co/kalle07/raw-txt-snippet-creator

Hints

  • Only windows tested!
  • Only txt files, tested with 2MB (one large book) ~10-20sec
  • Choose one txt file or a whole folder
  • Type one buzzword or more, only with AND (second search field) its connected with in a "distance option"
  • snippet size and distance all in characters (5000 chars ~one book page, ~1400token)
  • All matches found are cut out as a snippet (in % 0.3 before and 0.7 after the keyword)
  • All overlaped snippets ar merged
  • Two search options "usual exact + wildcard" and "fuzzy-search"
    (wildcard search If you have the word “friendship” and search for “friend” it will not be found. You should use “friend*”. "?" is only one character like usual.)
    (fuzzy is sometime usefully , but it dont work with any punctuation like ip adresses, but it can handle in some cases * and ?, in % I would not specify less than 80.)
  • All snippets are appended and saved (one for wildcard one for fuzzy - file) in json format with all snippets and found position (This file is overwritten with every search)
    (the position you can see eg: in notepad++)
  • first line also shows sum of all characters and estimated token
  • Output files are always overwritten when you click “Search” again
  • Now you can easily copy and paste to your chat
grafik

I am not responsible for any errors or crashes on your system. If you use it, you take full responsibility!

About

buzzword search with "and" option within distance. Its like an embedder only with plain txt search

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages