This is a multithreaded scraping script for Pastebin. It scrapes the main site for new pastes, downloads their raw content and processes them by a user-defined output format.
Fun.
The usual dance.
pip install -r requirements.txt
Define all required specs in settings.ini. Should you decide to go with a database output, make sure the respective connector is installed. At the moment MySQL with pymysql and SQLite with the standard built in Python 3 connector are supported.
Also note that the file output creates a subdirectory output and dumps every paste as a separate file into it.
ini is a highly underrated file format. Here are some definitions on what the settings parameter actually do.
PasteLimitStop after having scraped n pastes. Set to 0 for indefinite scrapingPBLinkURL to Pastebin or another equivalent siteDownloadWorkersNumber of workers that download the raw paste content and further process itNewPasteCheckIntervalTime to wait before checking the main site for new pastes againIPBlockedWaitTimeTime to wait until checking the main site again after the scraper's IP has been blocked
RotationLogLocation of log file that contains debug outputMaxRotationSizeSize in bytes before another log file is createdRotationBackupCountMaximum number of log files to keep
EnableEnable formatted stdout output of paste dataContentDisplayLimitMaximum amount of characters to show before content is cut off (0 to display all)ShowNameDisplay the paste nameShowLangDisplay the paste languageShowLinkDisplay the complete paste linkShowDataDisplay the raw paste contentDataEncodingEncoding of the raw paste data
EnableEnable MySQL outputTableNameMain table name to insert data intoHostMySQL server hostPortMySQL server portUsernameMySQL server userPasswordUser password
EnableEnable SQLite outputFilenameFilename the db should be saved as (usually ends with .db)TableNameMain table name to insert data into
If you use this thing for some cool data analysis or even research, let me know if I can help!
Inspiration for this scraper was taken from here.