Skip to content

Conversation

@jtojnar
Copy link
Contributor

@jtojnar jtojnar commented Mar 3, 2025

Backports #102, #104

jtojnar added 4 commits March 3, 2025 23:35
`parse_url($this->url, \PHP_URL_HOST)` will return `null` for local filesystem path.
Casting it to `string` will produce an empty regular expression,
which would match any link when computing link density.

(cherry picked from commit c7208f6)

This also fixes a warning since 1.x passes the `null` directly to `preg_replace` instead of explicitly casting it to `string`.
Also capitalize it properly.

(cherry picked from commit 90869d8)
It is unused since 8ab7d76.

(cherry picked from commit 541fab3)
`DOMDocument::loadHTML` will parse HTML documents as ISO-8859-1 if there is no `meta[charset]` tag. This means that UTF-8-encoded HTML fragments such as those coming from JSON-LD `articleBody` field would be parsed with incorrect encoding.

In f14428e, we tried to resolve it by putting `meta[charset]` tag at the start of the HTML fragment. Unfortunately, it turns out that causes parser to auto-insert a `html` element, losing the attributes of the original `html` tag.

Let’s try to insert the `meta[charset]` tag into the proper place in the HTML document.

We do not need to use the same trick with `JSLikeHTMLElement::__set`.
That expects smaller HTML fragments, not `html` documents, so creating `html` and `head` elements will not be a problem.

(cherry picked from commit efbbc86)

Had to strip type hints since we still target PHP 5.6.
@jtojnar jtojnar force-pushed the backports-local-no-domain branch from ac4bb9f to f1c6297 Compare March 4, 2025 00:59
@jtojnar jtojnar changed the title [1.x] Do not set domainRegExp for local files [1.x] Backport parser_url + html[lang] fixes Mar 4, 2025
@j0k3r j0k3r merged commit 109a226 into j0k3r:1.x Mar 4, 2025
15 checks passed
@jtojnar jtojnar deleted the backports-local-no-domain branch March 4, 2025 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants