Skip to content

#1597 - Java 20 deprecates all constructors of the class java.net.URL#1834

Open
rzo1 wants to merge 1 commit intomainfrom
1597
Open

#1597 - Java 20 deprecates all constructors of the class java.net.URL#1834
rzo1 wants to merge 1 commit intomainfrom
1597

Conversation

@rzo1
Copy link
Contributor

@rzo1 rzo1 commented Mar 20, 2026

For all changes

  • Is there a issue associated with this PR? Is it referenced in the commit message?

  • Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

  • Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

For code changes

  • Have you ensured that the full suite of tests is executed via mvn clean verify?
  • Have you written or updated unit tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE file, including the main LICENSE file?
  • If applicable, have you updated the NOTICE file, including the main NOTICE file?

@rzo1 rzo1 added this to the 3.5.2 milestone Mar 20, 2026
@rzo1 rzo1 requested review from jnioche, mvolikas and sigee March 20, 2026 19:44
@rzo1
Copy link
Contributor Author

rzo1 commented Mar 21, 2026

@dpol1 Since you are already familiar with our codebase, feel free to provide feedback on this PR :)

  1. URLUtil.toURL(String) utility method

  - Added URLUtil.toURL(String) that wraps new URI(url).toURL() and converts URISyntaxException into MalformedURLException —
  callers no longer need to handle two exception types
  - Fixed getHostSegments(String) to use toURL() and removed URISyntaxException from its signature

  2. Pre-sanitization in BasicURLNormalizer.sanitizeForURI()

  - Encodes common illegal URI characters before parsing: | → %7C, \ → %5C, space → %20, { → %7B, } → %7D
  - Converts non-standard %uXXXX percent encoding to proper UTF-8 percent encoding (e.g., %u011f → %C4%9F)
  - Applied at the start of filter() so all subsequent URI parsing succeeds
@dpol1
Copy link
Contributor

dpol1 commented Mar 21, 2026

@rzo1 thanks for tagging me, appreciate that - the changes LGTM!

However I was going through the documentation and noticed:
URLUtil.toURL() uses new URI(string).toURL() which is strict RFC 3986,
while the old new URL(string) was more permissive with characters like pipes (|) or unencoded spaces.

sanitizeForURI handles this correctly inside BasicURLNormalizer. However, my concern is about legacy data already present in the status store, or URLs injected directly into the backend (e.g. Solr/OpenSearch via external scripts) that bypass URLFilterBolt entirely. Since FetcherBolt reads them raw, something like http://example.com?q=a|b would have been fetched fine before and would now silently fail as ERROR.

Might be worth a follow-up issue, or documenting it as a known behavior change. What do you think?
Not a blocker for this PR though! Just a silent functional regression.

@rzo1
Copy link
Contributor Author

rzo1 commented Mar 21, 2026

Yes - that is the downside of the URL deprecation.

A follow-up issue documenting this as a known behavior change seems the right call, with a note in the upgrade/migration guide pointing out that pre-existing URLs in the status index may need to be re-normalized (or that a sanitization pass on the seeds (store) may be advisable before upgrading).

An alternative to handle that, could be to move the sanitization (from the normalizer) to URLUtil#toURL(...).

@dpol1
Copy link
Contributor

dpol1 commented Mar 21, 2026

Yes - that is the downside of the URL deprecation.

indeed

A follow-up issue documenting this as a known behavior change seems the right call, with a note in the upgrade/migration guide pointing out that pre-existing URLs in the status index may need to be re-normalized (or that a sanitization pass on the seeds (store) may be advisable before upgrading).

An alternative to handle that, could be to move the sanitization (from the normalizer) to URLUtil#toURL(...).

Agreed, very clever issue handling - I hadn't thought of that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants