[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened tags and counts chars correctly

Sat Apr 5 14:53:44 CEST 2008

Jochen Rau schrieb:
> Hello Martin,
> 
>> [...] you cannot simply change the behaviour of crop as it might me be 
>> used on plain text. So this will have to be stdWrap.cropHTML (or 
>> stdWrap.cropXML) which makes at new feature. Which makes it TYPO3_4-2 
>> only, that is if you get the ok from the release managers. RC1 is 
>> already out!
> 
> You can still crop plain text with the patched stdWrap.crop. So I don't 
> think we need a new feature like stdWrap.cropXML. The patch is just a 
> bugfix IMO.

How come? If my content was "blah <foobar> whatever" the patch will not 
count the content of <foobar> and will add </foobar>, so this is a 
change of the behaviour.

>> Not terribly thrilling I'm afraid. It's simple enough to use either 
>> strpos() or preg_match() * to walk through the html without having to 
>> split the entire content.
> I didn't intend to thrill you ;-).

My apologize for my wording. Fortunately you seem to have a good humour.

 > Before I posted the patch I tested
> three solutions:
> 1) crawling through the content only with strpos() and preg_match()
> 2) utilizing a stack and a recursive function
> 3) splitting the content using preg_split and crawling over the array
> 
> IMO the last solution is the best because the code is fast, readable and 
> (relatively) easy to maintain and debug (the code blocks that split, 
> crop and close tags are clearly separated).

I'm amazed that splitting the whole content is faster, but the reason is 
  probably the size of the content. I agree that is is unlike that you 
crop 500Kb of text, so it will maybe not matter much in real world 
situations.

As for readability: tastes vary ;-) I was rather confused when I saw the 
  multiple for-loop that process the text/arrays.

BTW, can you not build up a simple stack of open tags, so you easily 
close the tags without having to parse the text again.

>> A more serious problem is that you use html_entity_decode() which 
>> supports only a subset of the charset TYPO3 supports.
> 
> The most important charsets are supported by html_entity_decode() (like 
> utf-8, ISO-8859-1, ISO-8859-15, GB2312 or EUC-JP. It has also a fallback 
> to ISO-8859-1 and throws a warning if a charset is not supported.

Sorry, but this is not enough. The code must work for all charsets.

> Furthermore it is not recommended (see PHP-doc) to use entities if the 
> charset is multibyte.

Where? And with multibyte you mean utf-8? There are other multi-byte 
encodings that are not of the Unicode charset. Also you cannot assume 
the content passed to the function contains no entities just becasue it 
is "not recommended".

>> I also do not understand why you have a list of tags to split on. It's 
>> much easier and forward compatible to work on all tags that come along.
> 
> You are right. I first intended to restrict the parsing to a subset of 
> tags. It's not necessary anymore. The new regex (see below) should do 
> that now.

Thanx.

> 
>> Thanx for trying, but there is IMHO much room for improvement.
> 
> Thank you, too. I hope I'm on the right way ;-)
> 
>> * eg
>> preg_match('/<([^ >]+)([^>]*)>/', $html, $match, PREG_OFFSET_CAPTURE)
> 
> Your suggested (and my former) regex don't take all valid HTML-tags into 
> account like the following example shows:
> 
> <img title="next >" src="next.gif">   // valid HTML

Hm, I'm used to use entities here but, you're right the HTML 4.01 spec 
only advises HTML authors to use &gt; within attributes.

I fear that I am not alone with that. At least TYPO3 generated content 
will have all attributes values processed by htmlspecialchars().

Nitpicking: does the code cope with attributes enclosed in single quotes 
or unenclosed content like <img src='image.gif'> or <img src=image.gif>? 
They are valid too (though only as HTML and not as XHTML).

Masi