Releases: cyclone-github/spider
v0.9.1
What's Changed
- added -agent flag #8 by @cyclone-github in #10
- chore(deps): enable daily Dependabot for Go modules by @cyclone-github in #11
- ci: build/test Dependabot PRs by @cyclone-github in #12
- chore(deps): bump github.com/PuerkitoBio/goquery from 1.10.3 to 1.11.0 in the minor-and-patch group by @dependabot[bot] in #13
Full Changelog: v0.9.0...v0.9.1
Checksum:
f0ee44dd5d21b57bdbbe01116bc93a7eae868f8a spider.bin
dbf122decd8956543b58f1dabdf7c4628af5a3a9 spider_arm64.bin
54b3c8d7230b285b1f7d9d6213a92f9e29ec96b6 spider.exe
Jotti Antivirus Scan Results:
https://virusscan.jotti.org/en-US/filescanjob/d5k39hb77j,yo37fkvt9n,u1lz16ok9p
v0.9.0
Spider
Changelog:
- v0.9.0 by @cyclone-github in #7
- added flag "-url-match" to only crawl URLs containing a specified keyword; #6
- added notice to user if no URLs are crawled when using "-crawl 1 -url-match"
- exit early if zero URLs were crawled (no processing or file output)
- use custom User-Agent "Spider/0.9.0 (+https://github.com/cyclone-github/spider)"
- removed clearScreen function and its imports
- fixed crawl-depth calculation logic
- fixed restrict link collection to .html, .htm, .txt and extension-less paths
- upgraded dependencies and bumped Go version to v1.24.3
Spider: URL Mode
spider -url 'https://forum.hashpwn.net' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 1 -url-match wordlist -o forum.hashpwn.net_spider.txt
----------------------
| Cyclone's URL Spider |
----------------------
Crawling URL: https://forum.hashpwn.net
Base domain: forum.hashpwn.net
Crawl depth: 2
ngram len: 1-3
Crawl delay: 20ms (increase this to avoid rate limiting)
Timeout: 1 sec
URLs crawled: 2
Processing... [====================] 100.00%
Unique words: 475
Unique ngrams: 1977
Sorting n-grams by frequency...
Writing... [====================] 100.00%
Output file: forum.hashpwn.net_spider.txt
RAM used: 0.02 GB
Runtime: 2.283s
Spider: File Mode
spider -file kjv_bible.txt -sort -ngram 1-3
----------------------
| Cyclone's URL Spider |
----------------------
Reading file: kjv_bible.txt
ngram len: 1-3
Processing... [====================] 100.00%
Unique words: 35412
Unique ngrams: 877394
Sorting n-grams by frequency...
Writing... [====================] 100.00%
Output file: kjv_bible_spider.txt
RAM used: 0.13 GB
Runtime: 1.359s
Wordlist & ngram creation tool to crawl a given url or process a local file to create wordlists and/or ngrams (depending on flags given).
Usage Instructions:
- To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
spider -url 'https://github.com/cyclone-github'
- To set url crawl url depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
spider -url 'https://github.com/cyclone-github' -crawl 2 -ngram 1-5
- To set a custom output file, use flag "-o filename"
spider -url 'https://github.com/cyclone-github' -o wordlist.txt
- To set a delay to keep from being rate-limited, use flag "-delay nth" where nth is time in milliseconds
spider -url 'https://github.com/cyclone-github' -delay 100
- To set a URL timeout, use flag "-timeout nth" where nth is time in seconds
spider -url 'https://github.com/cyclone-github' -timeout 2
- To create ngrams len 1-3 and sort output by frequency, use "-ngram 1-3" "-sort"
spider -url 'https://github.com/cyclone-github' -ngram 1-3 -sort
- To filter crawled URLs by keyword "foobar"
spider -url 'https://github.com/cyclone-github' -url-match foobar
- To process a local text file, create ngrams len 1-3 and sort output by frequency
spider -file foobar.txt -ngram 1-3 -sort
- Run
spider -helpto see a list of all options
spider -help
-crawl int
Depth of links to crawl (default 1)
-cyclone
Display coded message
-delay int
Delay in ms between each URL lookup to avoid rate limiting (default 10)
-file string
Path to a local file to scrape
-url-match string
Only crawl URLs containing this keyword (case-insensitive)
-ngram string
Lengths of n-grams (e.g., "1-3" for 1, 2, and 3-length n-grams). (default "1")
-o string
Output file for the n-grams
-sort
Sort output by frequency
-timeout int
Timeout for URL crawling in seconds (default 1)
-url string
URL of the website to scrape
-version
Display version
Compile from source:
- If you want the latest features, compiling from source is the best option since the release version may run several revisions behind the source code.
- This assumes you have Go and Git installed
git clone https://github.com/cyclone-github/spider.git# clone repocd spider# enter project directorygo mod init spider# initialize Go module (skips if go.mod exists)go mod tidy# download dependenciesgo build -ldflags="-s -w" .# compile binary in current directorygo install -ldflags="-s -w" .# compile binary and install to $GOPATH
- Compile from source code how-to:
Changelog:
Mentions:
- Go Package Documentation: https://pkg.go.dev/github.com/cyclone-github/spider
- Softpedia: https://www.softpedia.com/get/Internet/Other-Internet-Related/Cyclone-s-URL-Spider.shtml
Antivirus False Positives:
- Several antivirus programs on VirusTotal incorrectly detect compiled Go binaries as a false positive. This issue primarily affects the Windows executable binary, but is not limited to it. If this concerns you, I recommend carefully reviewing the source code, then proceed to compile the binary yourself.
- Uploading your compiled binaries to https://virustotal.com and leaving an up-vote or a comment would be helpful as well.
v0.8.0
Spider
Changelog:
- v0.8.0 by @cyclone-github in #5
- added flag "-file" to allow creating ngrams from a local plaintext file (ex: foobar.txt)
- added flag "-timeout" for -url mode
- added flag "-sort" which sorts output by frequency
- fixed several small bugs
- https://github.com/cyclone-github/spider/blob/main/CHANGELOG.md
You can also use -file and -sort flags to frequency sort and dedup wordlists that contain dups.
ex: spider -file foobar.txt -sort
This optimizes wordlists by sorting them by probability with the most frequent occurring words being listed at the top.
Keep in mind that using -file and -sort:
- This only applies to wordlists which contain dups
- Sorting large wordlists is RAM intensive
- This feature is beta, so results may vary
Spider is a web crawler and wordlist/ngram generator written in Go that crawls specified URLs or local files to produce frequency-sorted wordlists and ngrams. Users can customize crawl depth, output files, frequency sort, and ngram options, making it ideal for web scraping to create targeted wordlists for tools like hashcat or John the Ripper. Spider combines the web scraping capabilities of CeWL and adds ngram generation, and since Spider is written in Go, it requires no additional libraries to download or install.
Spider just works.
Spider: URL Mode
spider -url 'https://forum.hashpwn.net' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 1 -o forum.hashpwn.net_spider.txt
----------------------
| Cyclone's URL Spider |
----------------------
Crawling URL: https://forum.hashpwn.net
Base domain: forum.hashpwn.net
Crawl depth: 2
ngram len: 1-3
Crawl delay: 20ms (increase this to avoid rate limiting)
Timeout: 1 sec
URLs crawled: 56
Processing... [====================] 100.00%
Unique words: 3164
Unique ngrams: 17313
Sorting n-grams by frequency...
Writing... [====================] 100.00%
Output file: forum.hashpwn.net_spider.txt
RAM used: 0.03 GB
Runtime: 8.634s
Spider: File Mode
spider -file kjv_bible.txt -sort -ngram 1-3
----------------------
| Cyclone's URL Spider |
----------------------
Reading file: kjv_bible.txt
ngram len: 1-3
Processing... [====================] 100.00%
Unique words: 35412
Unique ngrams: 877394
Sorting n-grams by frequency...
Writing... [====================] 100.00%
Output file: kjv_bible_spider.txt
RAM used: 0.13 GB
Runtime: 1.359s
Wordlist & ngram creation tool to crawl a given url or process a local file to create wordlists and/or ngrams (depending on flags given).
Usage Instructions:
- To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
spider -url 'https://github.com/cyclone-github'
- To set url crawl url depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
spider -url 'https://github.com/cyclone-github' -crawl 2 -ngram 1-5
- To set a custom output file, use flag "-o filename"
spider -url 'https://github.com/cyclone-github' -o wordlist.txt
- To set a delay to keep from being rate-limited, use flag "-delay nth" where nth is time in milliseconds
spider -url 'https://github.com/cyclone-github' -delay 100
- To set a URL timeout, use flag "-timeout nth" where nth is time in seconds
spider -url 'https://github.com/cyclone-github' -timeout 2
- To create ngrams len 1-3 and sort output by frequency, use "-ngram 1-3" "-sort"
spider -url 'https://github.com/cyclone-github' -ngram 1-3 -sort
- To process a local text file, create ngrams len 1-3 and sort output by frequency
spider -file foobar.txt -ngram 1-3 -sort
- Run
spider -helpto see a list of all options
4c80bc2f26e9ebd9445bac46315868dde8ba38374db4ef9c770c066ccc43a091 spider_amd64.bin
9ca7048f7b18ca3502fe84b1a2654a6d0ab23ca4a54996d90223a62f1bf4ca23 spider_arm64.bin
49bfab2856bfc95d8744e89ebacbe17f69ed04287f499640cea3564115931d34 spider_arm.bin
c4a6aa4de95ed3522f3a2e731eefabda55060a971d72094059b01ee118c1cff7 spider_amd64.exe
Jotti Antivirus Scan Results
Antivirus False Positives:
- Several antivirus programs on VirusTotal incorrectly detect Go compiled binaries as a false positive. This issue primarily affects the Windows executable binary, but is not limited to it. If this concerns you, I recommend carefully reviewing the source code, then proceed to compile the binary yourself.
- Uploading your compiled binaries to https://virustotal.com and leaving an upvote or a comment would be helpful as well.
v0.7.1
Change Log since 0.6.2:
v0.7.0 26edf33
- added feature to allow crawling specific file extensions (html, htm, txt)
- added check to keep crawler from crawling offsite URLs
- added flag "-delay" to avoid rate limiting (-delay 100 == 100ms delay between URL requests)
- added write buffer for better performance on large files
- increased crawl depth from 5 to 100 (not recommended, but enabled for edge cases)
- fixed out of bounds slice bug when crawling URLs with NIL characters
- fixed bug when attempting to crawl deeper than available URLs to crawl
- fixed crawl depth calculation
- optimized code which runs 2.8x faster vs v0.6.x during bench testing
-
v0.7.1 81a5439
- added progress bars to word / ngrams processing & file writing operations
- added RAM usage monitoring
- optimized order of operations for faster processing with less RAM
TO-DO: refactor code
045bca70d0f8be6326c9bae5c4f412f0af183f2859b5ac30f4e6efdfe06316bd spider_amd64.bin
8b0525a46a6aca19256e1326338a59e58585933558e85319d68cb0c609c500b2 spider_amd64-darwin
9671739d795c8913659c8169827124ba78725aef3205579d688d058571a9c96b spider_amd64.exe
50093e85868b77f40e5ece131597e9bbcda646fb2a80970a4d6791e7292a8f01 spider_arm64.bin
0ce2d7b5b232f82c3fe3fd5ca45659e7746400db4ac7842e29757fc226f26d76 spider_armhf.bin
Jotti Antivirus Scan Results
https://virusscan.jotti.org/en-US/filescanjob/fhv8de86sm,jrqzgdwd2b,sc18q9y8uj,gmi0zwcqs2,kjjl0g74m9
v0.6.2
version 0.6.2;
- fixed scraping logic & ngram creations bugs
- switched from gocolly to goquery for web scraping
- remove dups from word / ngrams output
- can be compiled for linux (amd64 & armhf / arm64 for raspberry pi), windows & mac
