[TYPO3-core] RFC: Bug #13972: cropHTML uses faulty reg exp for HTML entities

Thu Apr 15 21:25:13 CEST 2010

Hi Jigal.

On 15.04.10 20:30, Jigal van Hemert wrote:
> Ralf Hettinger wrote:
>> as one character. The search pattern as used in the current preg_match
>> currently always crops after the first semicolon and won't recognize
>> such entites reliably.
>
> Sorry, but this pattern isn't correct either.
> - valid entities can be longer than 7 characters (e.g. &thetasym;) [1]

As I implemented the first version of cropHTML, I read the spec linked 
above, too. I made a list of entitites but must have overseen the only 
entity name in the list having a length > 7 ;-)

> - not everything between & and ; is a valid entity

That's true. But the alternative is to build a list of entity names (at 
least the ones specified in [1]) and make a preg_match only with these 
ones. But someone can add some new entities which is allowed to in every 
SGML compliant language. What's next?

The possible impact of a mismatched RegEx is very low: A string like 
"&you;" in "Just me&you; can you imagine?" will be falsely detected to 
be an entity an counted as a single char. The cropped string will be at 
maximum 7 chars longer than expected (assumed that you don't write "Just 
me&you;me&you;me&you;me&you; can you imagine?" ;-) ). There is no risk 
cropping an entity.

IMO the proposed solution (with a corrected range from 2 to 8) is the 
best tradeoff between performance and the risk of not cropping exactly 
at the given position.

BTW If you are interested in the original thread, take some coffee and 
crawl for "[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened 
tags and counts chars correctly" (starting at 2008-04-03). It's funny to 
read but it will take it's time ;-)

Regards
Jochen

> [1] http://www.w3.org/TR/html4/sgml/entities.html