Conversation
|
@dpol1 Since you are already familiar with our codebase, feel free to provide feedback on this PR :) |
core/src/main/java/org/apache/stormcrawler/bolt/JSoupParserBolt.java
Outdated
Show resolved
Hide resolved
1. URLUtil.toURL(String) utility method
- Added URLUtil.toURL(String) that wraps new URI(url).toURL() and converts URISyntaxException into MalformedURLException —
callers no longer need to handle two exception types
- Fixed getHostSegments(String) to use toURL() and removed URISyntaxException from its signature
2. Pre-sanitization in BasicURLNormalizer.sanitizeForURI()
- Encodes common illegal URI characters before parsing: | → %7C, \ → %5C, space → %20, { → %7B, } → %7D
- Converts non-standard %uXXXX percent encoding to proper UTF-8 percent encoding (e.g., %u011f → %C4%9F)
- Applied at the start of filter() so all subsequent URI parsing succeeds
|
@rzo1 thanks for tagging me, appreciate that - the changes LGTM! However I was going through the documentation and noticed:
Might be worth a follow-up issue, or documenting it as a known behavior change. What do you think? |
|
Yes - that is the downside of the URL deprecation. A follow-up issue documenting this as a known behavior change seems the right call, with a note in the upgrade/migration guide pointing out that pre-existing URLs in the status index may need to be re-normalized (or that a sanitization pass on the seeds (store) may be advisable before upgrading). An alternative to handle that, could be to move the sanitization (from the normalizer) to |
indeed
Agreed, very clever issue handling - I hadn't thought of that. |
For all changes
Is there a issue associated with this PR? Is it referenced in the commit message?
Does your PR title start with
#XXXXwhereXXXXis the issue number you are trying to resolve?Has your PR been rebased against the latest commit within the target branch (typically main)?
Is your initial contribution a single, squashed commit?
Is the code properly formatted with
mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?For code changes
mvn clean verify?