[TYPO3-core] RFC: Bug #13972: cropHTML uses faulty reg exp for HTML entities

Jigal van Hemert jigal at xs4all.nl
Thu Apr 15 22:23:39 CEST 2010


Jochen Rau wrote:
> On 15.04.10 20:30, Jigal van Hemert wrote:
>> - valid entities can be longer than 7 characters (e.g. ϑ) [1]
> As I implemented the first version of cropHTML, I read the spec linked 
> above, too. I made a list of entitites but must have overseen the only 
> entity name in the list having a length > 7 ;-)

Seems I had a bright moment spotting that one in the list :-D

>> - not everything between & and ; is a valid entity
> That's true. But the alternative is to build a list of entity names (at 
> least the ones specified in [1]) and make a preg_match only with these 
> ones. But someone can add some new entities which is allowed to in every 
> SGML compliant language. What's next?
(...)
> BTW If you are interested in the original thread, take some coffee and 
> crawl for "[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened 
> tags and counts chars correctly" (starting at 2008-04-03). It's funny to 
> read but it will take it's time ;-)

I poured myself a new diet Coke and briefly scanned the thread.

First there was resistance about using html_entity_decode() and later it 
was used to determine the length of string with entities.

I couldn't find in the thread if it was so much slower to just traverse 
the html_entity_decoded string along with the original version and skip 
past the next ';' if the characters in both are different on the 
position of the pointer?
The code already relies on that function to determine the length. This 
way you'll end up with a cropped version of the original string and all 
entities supported by this function are recognized.

normal chars àvry ë end.
.............x       ....x     .....
normal chars à>>>>>>>vry ë>>>>> end.

-- 
Jigal van Hemert
skype:jigal.van.hemert
msn: jigal at xs4all.nl
http://twitter.com/jigalvh


More information about the TYPO3-team-core mailing list