18 Nov 19:38

cyclone-github

60db37f

v0.9.1 Latest

Latest

What's Changed

added -agent flag #8 by @cyclone-github in #10
chore(deps): enable daily Dependabot for Go modules by @cyclone-github in #11
ci: build/test Dependabot PRs by @cyclone-github in #12
chore(deps): bump github.com/PuerkitoBio/goquery from 1.10.3 to 1.11.0 in the minor-and-patch group by @dependabot[bot] in #13

Full Changelog: v0.9.0...v0.9.1

Checksum:

f0ee44dd5d21b57bdbbe01116bc93a7eae868f8a  spider.bin
dbf122decd8956543b58f1dabdf7c4628af5a3a9  spider_arm64.bin
54b3c8d7230b285b1f7d9d6213a92f9e29ec96b6  spider.exe

Jotti Antivirus Scan Results:
https://virusscan.jotti.org/en-US/filescanjob/d5k39hb77j,yo37fkvt9n,u1lz16ok9p

Contributors

dependabot and cyclone-github

Assets 5

13 May 13:52

cyclone-github

v0.9.0

159eed5

v0.9.0

Spider

Changelog:

v0.9.0 by @cyclone-github in #7
- added flag "-url-match" to only crawl URLs containing a specified keyword; #6
- added notice to user if no URLs are crawled when using "-crawl 1 -url-match"
- exit early if zero URLs were crawled (no processing or file output)
- use custom User-Agent "Spider/0.9.0 (+https://github.com/cyclone-github/spider)"
- removed clearScreen function and its imports
- fixed crawl-depth calculation logic
- fixed restrict link collection to .html, .htm, .txt and extension-less paths
- upgraded dependencies and bumped Go version to v1.24.3

Spider: URL Mode

spider -url 'https://forum.hashpwn.net' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 1 -url-match wordlist -o forum.hashpwn.net_spider.txt

 ---------------------- 
| Cyclone's URL Spider |
 ---------------------- 

Crawling URL:   https://forum.hashpwn.net
Base domain:    forum.hashpwn.net
Crawl depth:    2
ngram len:      1-3
Crawl delay:    20ms (increase this to avoid rate limiting)
Timeout:        1 sec
URLs crawled:   2
Processing...   [====================] 100.00%
Unique words:   475
Unique ngrams:  1977
Sorting n-grams by frequency...
Writing...      [====================] 100.00%
Output file:    forum.hashpwn.net_spider.txt
RAM used:       0.02 GB
Runtime:        2.283s

Spider: File Mode

spider -file kjv_bible.txt -sort -ngram 1-3

 ---------------------- 
| Cyclone's URL Spider |
 ---------------------- 

Reading file:   kjv_bible.txt
ngram len:      1-3
Processing...   [====================] 100.00%
Unique words:   35412
Unique ngrams:  877394
Sorting n-grams by frequency...
Writing...      [====================] 100.00%
Output file:    kjv_bible_spider.txt
RAM used:       0.13 GB
Runtime:        1.359s

Wordlist & ngram creation tool to crawl a given url or process a local file to create wordlists and/or ngrams (depending on flags given).

Usage Instructions:

To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
- spider -url 'https://github.com/cyclone-github'
To set url crawl url depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
- spider -url 'https://github.com/cyclone-github' -crawl 2 -ngram 1-5
To set a custom output file, use flag "-o filename"
- spider -url 'https://github.com/cyclone-github' -o wordlist.txt
To set a delay to keep from being rate-limited, use flag "-delay nth" where nth is time in milliseconds
- spider -url 'https://github.com/cyclone-github' -delay 100
To set a URL timeout, use flag "-timeout nth" where nth is time in seconds
- spider -url 'https://github.com/cyclone-github' -timeout 2
To create ngrams len 1-3 and sort output by frequency, use "-ngram 1-3" "-sort"
- spider -url 'https://github.com/cyclone-github' -ngram 1-3 -sort
To filter crawled URLs by keyword "foobar"
- spider -url 'https://github.com/cyclone-github' -url-match foobar
To process a local text file, create ngrams len 1-3 and sort output by frequency
- spider -file foobar.txt -ngram 1-3 -sort
Run spider -help to see a list of all options

spider -help

  -crawl int
        Depth of links to crawl (default 1)
  -cyclone
        Display coded message
  -delay int
        Delay in ms between each URL lookup to avoid rate limiting (default 10)
  -file string
        Path to a local file to scrape
  -url-match string
        Only crawl URLs containing this keyword (case-insensitive)
  -ngram string
        Lengths of n-grams (e.g., "1-3" for 1, 2, and 3-length n-grams). (default "1")
  -o string
        Output file for the n-grams
  -sort
        Sort output by frequency
  -timeout int
        Timeout for URL crawling in seconds (default 1)
  -url string
        URL of the website to scrape
  -version
        Display version

Compile from source:

If you want the latest features, compiling from source is the best option since the release version may run several revisions behind the source code.
This assumes you have Go and Git installed
- git clone https://github.com/cyclone-github/spider.git # clone repo
- cd spider # enter project directory
- go mod init spider # initialize Go module (skips if go.mod exists)
- go mod tidy # download dependencies
- go build -ldflags="-s -w" . # compile binary in current directory
- go install -ldflags="-s -w" . # compile binary and install to $GOPATH
Compile from source code how-to:
- https://github.com/cyclone-github/scripts/blob/main/intro_to_go.txt

Changelog:

https://github.com/cyclone-github/spider/blob/main/CHANGELOG.md

Mentions:

Go Package Documentation: https://pkg.go.dev/github.com/cyclone-github/spider
Softpedia: https://www.softpedia.com/get/Internet/Other-Internet-Related/Cyclone-s-URL-Spider.shtml

Antivirus False Positives:

Several antivirus programs on VirusTotal incorrectly detect compiled Go binaries as a false positive. This issue primarily affects the Windows executable binary, but is not limited to it. If this concerns you, I recommend carefully reviewing the source code, then proceed to compile the binary yourself.
Uploading your compiled binaries to https://virustotal.com and leaving an up-vote or a comment would be helpful as well.

Contributors

cyclone-github

Assets 2

21 Apr 18:33

cyclone-github

v0.8.0

7eaad95

v0.8.0

Spider

Changelog:

v0.8.0 by @cyclone-github in #5
- added flag "-file" to allow creating ngrams from a local plaintext file (ex: foobar.txt)
- added flag "-timeout" for -url mode
- added flag "-sort" which sorts output by frequency
- fixed several small bugs
https://github.com/cyclone-github/spider/blob/main/CHANGELOG.md

You can also use -file and -sort flags to frequency sort and dedup wordlists that contain dups.
ex: spider -file foobar.txt -sort
This optimizes wordlists by sorting them by probability with the most frequent occurring words being listed at the top.

Keep in mind that using -file and -sort:

This only applies to wordlists which contain dups
Sorting large wordlists is RAM intensive
This feature is beta, so results may vary

Spider is a web crawler and wordlist/ngram generator written in Go that crawls specified URLs or local files to produce frequency-sorted wordlists and ngrams. Users can customize crawl depth, output files, frequency sort, and ngram options, making it ideal for web scraping to create targeted wordlists for tools like hashcat or John the Ripper. Spider combines the web scraping capabilities of CeWL and adds ngram generation, and since Spider is written in Go, it requires no additional libraries to download or install.
Spider just works.

Spider: URL Mode

spider -url 'https://forum.hashpwn.net' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 1 -o forum.hashpwn.net_spider.txt

 ---------------------- 
| Cyclone's URL Spider |
 ---------------------- 

Crawling URL:   https://forum.hashpwn.net
Base domain:    forum.hashpwn.net
Crawl depth:    2
ngram len:      1-3
Crawl delay:    20ms (increase this to avoid rate limiting)
Timeout:        1 sec
URLs crawled:   56
Processing...   [====================] 100.00%
Unique words:   3164
Unique ngrams:  17313
Sorting n-grams by frequency...
Writing...      [====================] 100.00%
Output file:    forum.hashpwn.net_spider.txt
RAM used:       0.03 GB
Runtime:        8.634s

Spider: File Mode

spider -file kjv_bible.txt -sort -ngram 1-3

 ---------------------- 
| Cyclone's URL Spider |
 ---------------------- 

Reading file:   kjv_bible.txt
ngram len:      1-3
Processing...   [====================] 100.00%
Unique words:   35412
Unique ngrams:  877394
Sorting n-grams by frequency...
Writing...      [====================] 100.00%
Output file:    kjv_bible_spider.txt
RAM used:       0.13 GB
Runtime:        1.359s

Wordlist & ngram creation tool to crawl a given url or process a local file to create wordlists and/or ngrams (depending on flags given).

Usage Instructions:

To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
- spider -url 'https://github.com/cyclone-github'
To set url crawl url depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
- spider -url 'https://github.com/cyclone-github' -crawl 2 -ngram 1-5
To set a custom output file, use flag "-o filename"
- spider -url 'https://github.com/cyclone-github' -o wordlist.txt
To set a delay to keep from being rate-limited, use flag "-delay nth" where nth is time in milliseconds
- spider -url 'https://github.com/cyclone-github' -delay 100
To set a URL timeout, use flag "-timeout nth" where nth is time in seconds
- spider -url 'https://github.com/cyclone-github' -timeout 2
To create ngrams len 1-3 and sort output by frequency, use "-ngram 1-3" "-sort"
- spider -url 'https://github.com/cyclone-github' -ngram 1-3 -sort
To process a local text file, create ngrams len 1-3 and sort output by frequency
- spider -file foobar.txt -ngram 1-3 -sort
Run spider -help to see a list of all options

4c80bc2f26e9ebd9445bac46315868dde8ba38374db4ef9c770c066ccc43a091  spider_amd64.bin
9ca7048f7b18ca3502fe84b1a2654a6d0ab23ca4a54996d90223a62f1bf4ca23  spider_arm64.bin
49bfab2856bfc95d8744e89ebacbe17f69ed04287f499640cea3564115931d34  spider_arm.bin
c4a6aa4de95ed3522f3a2e731eefabda55060a971d72094059b01ee118c1cff7  spider_amd64.exe

Jotti Antivirus Scan Results

https://virusscan.jotti.org/en-US/filescanjob/b7in8ptyzz,gl05c7byd6,glbcdp7joe,rp3j64tw8o

Antivirus False Positives:

Several antivirus programs on VirusTotal incorrectly detect Go compiled binaries as a false positive. This issue primarily affects the Windows executable binary, but is not limited to it. If this concerns you, I recommend carefully reviewing the source code, then proceed to compile the binary yourself.
Uploading your compiled binaries to https://virustotal.com and leaving an upvote or a comment would be helpful as well.

Contributors

cyclone-github

Assets 6

01 Jan 22:12

cyclone-github

v0.7.1

81a5439

v0.7.1

Change Log since 0.6.2:

v0.7.0 `26edf33`

added feature to allow crawling specific file extensions (html, htm, txt)
added check to keep crawler from crawling offsite URLs
added flag "-delay" to avoid rate limiting (-delay 100 == 100ms delay between URL requests)
added write buffer for better performance on large files
increased crawl depth from 5 to 100 (not recommended, but enabled for edge cases)
fixed out of bounds slice bug when crawling URLs with NIL characters
fixed bug when attempting to crawl deeper than available URLs to crawl
fixed crawl depth calculation
optimized code which runs 2.8x faster vs v0.6.x during bench testing
v0.7.1 81a5439
added progress bars to word / ngrams processing & file writing operations
added RAM usage monitoring
optimized order of operations for faster processing with less RAM
TO-DO: refactor code

045bca70d0f8be6326c9bae5c4f412f0af183f2859b5ac30f4e6efdfe06316bd  spider_amd64.bin
8b0525a46a6aca19256e1326338a59e58585933558e85319d68cb0c609c500b2  spider_amd64-darwin
9671739d795c8913659c8169827124ba78725aef3205579d688d058571a9c96b  spider_amd64.exe
50093e85868b77f40e5ece131597e9bbcda646fb2a80970a4d6791e7292a8f01  spider_arm64.bin
0ce2d7b5b232f82c3fe3fd5ca45659e7746400db4ac7842e29757fc226f26d76  spider_armhf.bin

Jotti Antivirus Scan Results

https://virusscan.jotti.org/en-US/filescanjob/fhv8de86sm,jrqzgdwd2b,sc18q9y8uj,gmi0zwcqs2,kjjl0g74m9

Assets 7

15 Aug 03:16

cyclone-github

v0.6.2

3eaa1d9

v0.6.2

version 0.6.2;

fixed scraping logic & ngram creations bugs
switched from gocolly to goquery for web scraping
remove dups from word / ngrams output
can be compiled for linux (amd64 & armhf / arm64 for raspberry pi), windows & mac

Assets 7

Releases: cyclone-github/spider

v0.9.1

What's Changed

Contributors

Uh oh!

v0.9.0

Spider

Changelog:

Spider: URL Mode

Spider: File Mode

Usage Instructions:

spider -help

Compile from source:

Changelog:

Mentions:

Antivirus False Positives:

Contributors

Uh oh!

v0.8.0

Spider

Changelog:

Spider: URL Mode

Spider: File Mode

Usage Instructions:

Jotti Antivirus Scan Results

Antivirus False Positives:

Contributors

Uh oh!

v0.7.1

Change Log since 0.6.2:

v0.7.0 26edf33

v0.7.1 81a5439

Jotti Antivirus Scan Results

Uh oh!

v0.6.2

Uh oh!

v0.7.0 `26edf33`

v0.7.1 `81a5439`