[TYPO3-core] RFC: Bug #13972: cropHTML uses faulty reg exp for HTML entities
Jochen Rau
jochen.rau at typoplanet.de
Thu Apr 15 21:25:13 CEST 2010
Hi Jigal.
On 15.04.10 20:30, Jigal van Hemert wrote:
> Ralf Hettinger wrote:
>> as one character. The search pattern as used in the current preg_match
>> currently always crops after the first semicolon and won't recognize
>> such entites reliably.
>
> Sorry, but this pattern isn't correct either.
> - valid entities can be longer than 7 characters (e.g. ϑ) [1]
As I implemented the first version of cropHTML, I read the spec linked
above, too. I made a list of entitites but must have overseen the only
entity name in the list having a length > 7 ;-)
> - not everything between & and ; is a valid entity
That's true. But the alternative is to build a list of entity names (at
least the ones specified in [1]) and make a preg_match only with these
ones. But someone can add some new entities which is allowed to in every
SGML compliant language. What's next?
The possible impact of a mismatched RegEx is very low: A string like
"&you;" in "Just me&you; can you imagine?" will be falsely detected to
be an entity an counted as a single char. The cropped string will be at
maximum 7 chars longer than expected (assumed that you don't write "Just
me&you;me&you;me&you;me&you; can you imagine?" ;-) ). There is no risk
cropping an entity.
IMO the proposed solution (with a corrected range from 2 to 8) is the
best tradeoff between performance and the risk of not cropping exactly
at the given position.
BTW If you are interested in the original thread, take some coffee and
crawl for "[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened
tags and counts chars correctly" (starting at 2008-04-03). It's funny to
read but it will take it's time ;-)
Regards
Jochen
> [1] http://www.w3.org/TR/html4/sgml/entities.html
More information about the TYPO3-team-core
mailing list