Skip to content

Releases: cyclone-github/spider

v0.9.1

18 Nov 19:38
60db37f

Choose a tag to compare

What's Changed

Full Changelog: v0.9.0...v0.9.1

Checksum:

f0ee44dd5d21b57bdbbe01116bc93a7eae868f8a  spider.bin
dbf122decd8956543b58f1dabdf7c4628af5a3a9  spider_arm64.bin
54b3c8d7230b285b1f7d9d6213a92f9e29ec96b6  spider.exe

Jotti Antivirus Scan Results:
https://virusscan.jotti.org/en-US/filescanjob/d5k39hb77j,yo37fkvt9n,u1lz16ok9p

v0.9.0

13 May 13:52
159eed5

Choose a tag to compare

Spider

Changelog:

  • v0.9.0 by @cyclone-github in #7
    • added flag "-url-match" to only crawl URLs containing a specified keyword; #6
    • added notice to user if no URLs are crawled when using "-crawl 1 -url-match"
    • exit early if zero URLs were crawled (no processing or file output)
    • use custom User-Agent "Spider/0.9.0 (+https://github.com/cyclone-github/spider)"
    • removed clearScreen function and its imports
    • fixed crawl-depth calculation logic
    • fixed restrict link collection to .html, .htm, .txt and extension-less paths
    • upgraded dependencies and bumped Go version to v1.24.3

Spider: URL Mode

spider -url 'https://forum.hashpwn.net' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 1 -url-match wordlist -o forum.hashpwn.net_spider.txt
 ---------------------- 
| Cyclone's URL Spider |
 ---------------------- 

Crawling URL:   https://forum.hashpwn.net
Base domain:    forum.hashpwn.net
Crawl depth:    2
ngram len:      1-3
Crawl delay:    20ms (increase this to avoid rate limiting)
Timeout:        1 sec
URLs crawled:   2
Processing...   [====================] 100.00%
Unique words:   475
Unique ngrams:  1977
Sorting n-grams by frequency...
Writing...      [====================] 100.00%
Output file:    forum.hashpwn.net_spider.txt
RAM used:       0.02 GB
Runtime:        2.283s

Spider: File Mode

spider -file kjv_bible.txt -sort -ngram 1-3
 ---------------------- 
| Cyclone's URL Spider |
 ---------------------- 

Reading file:   kjv_bible.txt
ngram len:      1-3
Processing...   [====================] 100.00%
Unique words:   35412
Unique ngrams:  877394
Sorting n-grams by frequency...
Writing...      [====================] 100.00%
Output file:    kjv_bible_spider.txt
RAM used:       0.13 GB
Runtime:        1.359s

Wordlist & ngram creation tool to crawl a given url or process a local file to create wordlists and/or ngrams (depending on flags given).

Usage Instructions:

  • To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
    • spider -url 'https://github.com/cyclone-github'
  • To set url crawl url depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
    • spider -url 'https://github.com/cyclone-github' -crawl 2 -ngram 1-5
  • To set a custom output file, use flag "-o filename"
    • spider -url 'https://github.com/cyclone-github' -o wordlist.txt
  • To set a delay to keep from being rate-limited, use flag "-delay nth" where nth is time in milliseconds
    • spider -url 'https://github.com/cyclone-github' -delay 100
  • To set a URL timeout, use flag "-timeout nth" where nth is time in seconds
    • spider -url 'https://github.com/cyclone-github' -timeout 2
  • To create ngrams len 1-3 and sort output by frequency, use "-ngram 1-3" "-sort"
    • spider -url 'https://github.com/cyclone-github' -ngram 1-3 -sort
  • To filter crawled URLs by keyword "foobar"
    • spider -url 'https://github.com/cyclone-github' -url-match foobar
  • To process a local text file, create ngrams len 1-3 and sort output by frequency
    • spider -file foobar.txt -ngram 1-3 -sort
  • Run spider -help to see a list of all options

spider -help

  -crawl int
        Depth of links to crawl (default 1)
  -cyclone
        Display coded message
  -delay int
        Delay in ms between each URL lookup to avoid rate limiting (default 10)
  -file string
        Path to a local file to scrape
  -url-match string
        Only crawl URLs containing this keyword (case-insensitive)
  -ngram string
        Lengths of n-grams (e.g., "1-3" for 1, 2, and 3-length n-grams). (default "1")
  -o string
        Output file for the n-grams
  -sort
        Sort output by frequency
  -timeout int
        Timeout for URL crawling in seconds (default 1)
  -url string
        URL of the website to scrape
  -version
        Display version

Compile from source:

  • If you want the latest features, compiling from source is the best option since the release version may run several revisions behind the source code.
  • This assumes you have Go and Git installed
    • git clone https://github.com/cyclone-github/spider.git # clone repo
    • cd spider # enter project directory
    • go mod init spider # initialize Go module (skips if go.mod exists)
    • go mod tidy # download dependencies
    • go build -ldflags="-s -w" . # compile binary in current directory
    • go install -ldflags="-s -w" . # compile binary and install to $GOPATH
  • Compile from source code how-to:

Changelog:

Mentions:

Antivirus False Positives:

  • Several antivirus programs on VirusTotal incorrectly detect compiled Go binaries as a false positive. This issue primarily affects the Windows executable binary, but is not limited to it. If this concerns you, I recommend carefully reviewing the source code, then proceed to compile the binary yourself.
  • Uploading your compiled binaries to https://virustotal.com and leaving an up-vote or a comment would be helpful as well.

v0.8.0

21 Apr 18:33
7eaad95

Choose a tag to compare

Spider

Changelog:

You can also use -file and -sort flags to frequency sort and dedup wordlists that contain dups.
ex: spider -file foobar.txt -sort
This optimizes wordlists by sorting them by probability with the most frequent occurring words being listed at the top.

Keep in mind that using -file and -sort:

  1. This only applies to wordlists which contain dups
  2. Sorting large wordlists is RAM intensive
  3. This feature is beta, so results may vary

Spider is a web crawler and wordlist/ngram generator written in Go that crawls specified URLs or local files to produce frequency-sorted wordlists and ngrams. Users can customize crawl depth, output files, frequency sort, and ngram options, making it ideal for web scraping to create targeted wordlists for tools like hashcat or John the Ripper. Spider combines the web scraping capabilities of CeWL and adds ngram generation, and since Spider is written in Go, it requires no additional libraries to download or install.
Spider just works.

Spider: URL Mode

spider -url 'https://forum.hashpwn.net' -crawl 2 -delay 20 -sort -ngram 1-3 -timeout 1 -o forum.hashpwn.net_spider.txt
 ---------------------- 
| Cyclone's URL Spider |
 ---------------------- 

Crawling URL:   https://forum.hashpwn.net
Base domain:    forum.hashpwn.net
Crawl depth:    2
ngram len:      1-3
Crawl delay:    20ms (increase this to avoid rate limiting)
Timeout:        1 sec
URLs crawled:   56
Processing...   [====================] 100.00%
Unique words:   3164
Unique ngrams:  17313
Sorting n-grams by frequency...
Writing...      [====================] 100.00%
Output file:    forum.hashpwn.net_spider.txt
RAM used:       0.03 GB
Runtime:        8.634s

Spider: File Mode

spider -file kjv_bible.txt -sort -ngram 1-3
 ---------------------- 
| Cyclone's URL Spider |
 ---------------------- 

Reading file:   kjv_bible.txt
ngram len:      1-3
Processing...   [====================] 100.00%
Unique words:   35412
Unique ngrams:  877394
Sorting n-grams by frequency...
Writing...      [====================] 100.00%
Output file:    kjv_bible_spider.txt
RAM used:       0.13 GB
Runtime:        1.359s

Wordlist & ngram creation tool to crawl a given url or process a local file to create wordlists and/or ngrams (depending on flags given).

Usage Instructions:

  • To create a simple wordlist from a specified url (will save deduplicated wordlist to url_spider.txt):
    • spider -url 'https://github.com/cyclone-github'
  • To set url crawl url depth of 2 and create ngrams len 1-5, use flag "-crawl 2" and "-ngram 1-5"
    • spider -url 'https://github.com/cyclone-github' -crawl 2 -ngram 1-5
  • To set a custom output file, use flag "-o filename"
    • spider -url 'https://github.com/cyclone-github' -o wordlist.txt
  • To set a delay to keep from being rate-limited, use flag "-delay nth" where nth is time in milliseconds
    • spider -url 'https://github.com/cyclone-github' -delay 100
  • To set a URL timeout, use flag "-timeout nth" where nth is time in seconds
    • spider -url 'https://github.com/cyclone-github' -timeout 2
  • To create ngrams len 1-3 and sort output by frequency, use "-ngram 1-3" "-sort"
    • spider -url 'https://github.com/cyclone-github' -ngram 1-3 -sort
  • To process a local text file, create ngrams len 1-3 and sort output by frequency
    • spider -file foobar.txt -ngram 1-3 -sort
  • Run spider -help to see a list of all options
4c80bc2f26e9ebd9445bac46315868dde8ba38374db4ef9c770c066ccc43a091  spider_amd64.bin
9ca7048f7b18ca3502fe84b1a2654a6d0ab23ca4a54996d90223a62f1bf4ca23  spider_arm64.bin
49bfab2856bfc95d8744e89ebacbe17f69ed04287f499640cea3564115931d34  spider_arm.bin
c4a6aa4de95ed3522f3a2e731eefabda55060a971d72094059b01ee118c1cff7  spider_amd64.exe

Jotti Antivirus Scan Results

Antivirus False Positives:

  • Several antivirus programs on VirusTotal incorrectly detect Go compiled binaries as a false positive. This issue primarily affects the Windows executable binary, but is not limited to it. If this concerns you, I recommend carefully reviewing the source code, then proceed to compile the binary yourself.
  • Uploading your compiled binaries to https://virustotal.com and leaving an upvote or a comment would be helpful as well.

v0.7.1

01 Jan 22:12
81a5439

Choose a tag to compare

image

Change Log since 0.6.2:

v0.7.0 26edf33

  • added feature to allow crawling specific file extensions (html, htm, txt)
  • added check to keep crawler from crawling offsite URLs
  • added flag "-delay" to avoid rate limiting (-delay 100 == 100ms delay between URL requests)
  • added write buffer for better performance on large files
  • increased crawl depth from 5 to 100 (not recommended, but enabled for edge cases)
  • fixed out of bounds slice bug when crawling URLs with NIL characters
  • fixed bug when attempting to crawl deeper than available URLs to crawl
  • fixed crawl depth calculation
  • optimized code which runs 2.8x faster vs v0.6.x during bench testing
  • v0.7.1 81a5439

  • added progress bars to word / ngrams processing & file writing operations
  • added RAM usage monitoring
  • optimized order of operations for faster processing with less RAM
    TO-DO: refactor code
045bca70d0f8be6326c9bae5c4f412f0af183f2859b5ac30f4e6efdfe06316bd  spider_amd64.bin
8b0525a46a6aca19256e1326338a59e58585933558e85319d68cb0c609c500b2  spider_amd64-darwin
9671739d795c8913659c8169827124ba78725aef3205579d688d058571a9c96b  spider_amd64.exe
50093e85868b77f40e5ece131597e9bbcda646fb2a80970a4d6791e7292a8f01  spider_arm64.bin
0ce2d7b5b232f82c3fe3fd5ca45659e7746400db4ac7842e29757fc226f26d76  spider_armhf.bin

Jotti Antivirus Scan Results

https://virusscan.jotti.org/en-US/filescanjob/fhv8de86sm,jrqzgdwd2b,sc18q9y8uj,gmi0zwcqs2,kjjl0g74m9

v0.6.2

15 Aug 03:16
3eaa1d9

Choose a tag to compare

version 0.6.2;

  • fixed scraping logic & ngram creations bugs
  • switched from gocolly to goquery for web scraping
  • remove dups from word / ngrams output
  • can be compiled for linux (amd64 & armhf / arm64 for raspberry pi), windows & mac