[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened tags and counts chars correctly

Sat Apr 5 17:18:43 CEST 2008

Hello Martin,

>> You can still crop plain text with the patched stdWrap.crop. So I 
>> don't think we need a new feature like stdWrap.cropXML. The patch is 
>> just a bugfix IMO.
> 
> How come? If my content was "blah <foobar> whatever" the patch will not 
> count the content of <foobar> and will add </foobar>, so this is a 
> change of the behaviour.

A bugfix should always change the behaviour ;-). But I agree with you 
<foobar> should not be recognized as a tag. Maybe I should introduce a 
list of tags like

$tags= 
'a|b|blockquote|body|div|em|font|form|h1|h2|h3|h4|h5|h6|i|li|map|ol| 
option|p|pre|sub|sup|select|span|strong|table|td|textarea|tr|u|ul|br| 
hr|img|input|area|link'; ;-)

The new patch should solve that problem.

I can't reproduce the the closing </foobar>. Could you please give me a 
hint how you produced that?

BTW this happens only if the content isn't parsed by the parseFunc which 
converts all <> in non HTML- and non typo-tags into &lt; and &gt;. The 
only solution inside this patch is to parseFunc the content after 
cropping it.

>  > Before I posted the patch I tested
>> three solutions:
>> 1) crawling through the content only with strpos() and preg_match()
>> 2) utilizing a stack and a recursive function
>> 3) splitting the content using preg_split and crawling over the array
>>
>> IMO the last solution is the best because the code is fast, readable 
>> and (relatively) easy to maintain and debug (the code blocks that 
>> split, crop and close tags are clearly separated).
> 
> I'm amazed that splitting the whole content is faster, but the reason is 
>  probably the size of the content. I agree that is is unlike that you 
> crop 500Kb of text, so it will maybe not matter much in real world 
> situations.

The algorithm is crawling backwards through the cropped text which is 
probably shorter than 500kb.

> As for readability: tastes vary ;-) I was rather confused when I saw the 
>  multiple for-loop that process the text/arrays.

I know what you mean. I tried to read t3lib_cs ;-)

> BTW, can you not build up a simple stack of open tags, so you easily 
> close the tags without having to parse the text again.

Hm. That was my first idea, too. But I ran into trouble while cropping 
backwards. I lost the attributes of opening tags. So I have to crawl 
through the content array to fetch the appropriate opening tag.

>>> A more serious problem is that you use html_entity_decode() which 
>>> supports only a subset of the charset TYPO3 supports.
>>
>> The most important charsets are supported by html_entity_decode() 
>> (like utf-8, ISO-8859-1, ISO-8859-15, GB2312 or EUC-JP. It has also a 
>> fallback to ISO-8859-1 and throws a warning if a charset is not 
>> supported.
> 
> Sorry, but this is not enough. The code must work for all charsets.

Ok, let's assume you use a scary charset. What happens is that the 
entities are not decoded and thus counted as a bunch of chars. Since the 
content is not affected by the decoding. There will only be a few more 
chars cropped.

IMO the last possible solution to solve the charset problem is to drop 
the html_entity_decode() in my patch.

>> Furthermore it is not recommended (see PHP-doc) to use entities if the 
>> charset is multibyte.
> 
> Where?

In the whole HTML-code.

>And with multibyte you mean utf-8? There are other multi-byte 
> encodings that are not of the Unicode charset. Also you cannot assume 
> the content passed to the function contains no entities just becasue it 
> is "not recommended".

I meant multi-byte. To be precisely: It's not recommended to use 
entities if the the charset comprises the char represented by the 
entity. Entities in attributes are not a problem for the patched 
stdWrap.crop but may cause strange FE output in general if an old 
browser is used (see below)

>> Your suggested (and my former) regex don't take all valid HTML-tags 
>> into account like the following example shows:
>>
>> <img title="next >" src="next.gif">   // valid HTML
> 
> Hm, I'm used to use entities here but, you're right the HTML 4.01 spec 
> only advises HTML authors to use &gt; within attributes.
> 
> I fear that I am not alone with that. At least TYPO3 generated content 
> will have all attributes values processed by htmlspecialchars().

Ok, I read the specs once again. While &gt; is allowed in attribute 
values &uuml; for example is NOT. This should be &amp;uuml; because only 
the "special five" (&amp; &quot; &apos; &lt; &gt;) are allowed (see 
http://www.w3.org/TR/xhtml1/#C_12).

The values of attributes should therefore not be processed by 
htmlspecialchars() if the char is represented in the used charset.

I believe that most of the editiors typing in a title for an image are 
not aware of entities. So it should work for all cases.

Anyhow the tag <img title="next >" src="next.gif"> is not correctly 
processed because of a bug in the parseFunc. I try to fix that bug at 
the moment.

> Nitpicking: does the code cope with attributes enclosed in single quotes 
> or unenclosed content like <img src='image.gif'> or <img src=image.gif>? 
> They are valid too (though only as HTML and not as XHTML).

Yes, it does. This part of the main regex is capable of this:

[...]
(?:
   \".*?\" # double quoted attribute values
   |
   '.*?'	# single quoted attribute values
   |
   [^'\">\s]+ # a string without " or ' followed by a space
)
[...]

("Every nit picked is a bug fixed before it occurs." (anonymus) ;-))

Greetings
Jochen
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bug_7984.diff
Url: http://lists.netfielders.de/pipermail/typo3-team-core/attachments/20080405/48d25800/attachment-0001.txt