[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened tags and counts chars correctly

Sat Apr 5 19:41:59 CEST 2008

Jochen Rau schrieb:
> Hello Martin,
> 
>>> You can still crop plain text with the patched stdWrap.crop. So I 
>>> don't think we need a new feature like stdWrap.cropXML. The patch is 
>>> just a bugfix IMO.
>>
>> How come? If my content was "blah <foobar> whatever" the patch will 
>> not count the content of <foobar> and will add </foobar>, so this is a 
>> change of the behaviour.
> 
> A bugfix should always change the behaviour ;-). But I agree with you 
> <foobar> should not be recognized as a tag.

That wasn't my point. My point was that I want to crop my example string 
as plain text! The cropping simply shouldn't care about any < characters.

That's why we either need a new stdWrap function or a new argumebt to 
stdWrap.crop, eg stdWrap.crop.html that enables "html-mode":

stdWrap.crop = <content>
stdWrap.crop.html = 1

> I can't reproduce the the closing </foobar>. Could you please give me a 
> hint how you produced that?

Ok, didn't actually try it, bit your description suggested auto-closing 
of tags. So in theory foobar should have been closed.

>> As for readability: tastes vary ;-) I was rather confused when I saw 
>> the  multiple for-loop that process the text/arrays.
> 
> I know what you mean. I tried to read t3lib_cs ;-)

I hope you like mine parts better then Kasper's ;-)

>>>> A more serious problem is that you use html_entity_decode() which 
>>>> supports only a subset of the charset TYPO3 supports.
>>>
>>> The most important charsets are supported by html_entity_decode() 
>>> (like utf-8, ISO-8859-1, ISO-8859-15, GB2312 or EUC-JP. It has also a 
>>> fallback to ISO-8859-1 and throws a warning if a charset is not 
>>> supported.
>>
>> Sorry, but this is not enough. The code must work for all charsets.
> 
> Ok, let's assume you use a scary charset. What happens is that the 
> entities are not decoded and thus counted as a bunch of chars. Since the 
> content is not affected by the decoding. There will only be a few more 
> chars cropped.

Hoepfully not within the entity.

> IMO the last possible solution to solve the charset problem is to drop 
> the html_entity_decode() in my patch.
> 
>>> Furthermore it is not recommended (see PHP-doc) to use entities if 
>>> the charset is multibyte.
>>
>> Where?
> 
> In the whole HTML-code.

I meant, where is this advice been given.

> 
>> And with multibyte you mean utf-8? There are other multi-byte 
>> encodings that are not of the Unicode charset. Also you cannot assume 
>> the content passed to the function contains no entities just becasue 
>> it is "not recommended".
> 
> I meant multi-byte. To be precisely: It's not recommended to use 
> entities if the the charset comprises the char represented by the 
> entity.

That amounts to "you don't need to use entities if your charset contains 
the desired characters".

> Ok, I read the specs once again. While &gt; is allowed in attribute 
> values &uuml; for example is NOT. This should be &amp;uuml; because only 
> the "special five" (&amp; &quot; &apos; &lt; &gt;) are allowed (see 
> http://www.w3.org/TR/xhtml1/#C_12).

You misinterpret this. This is about single & in an attribute, eg within 
a URL. Of course entities are allowed in attributes.

> I believe that most of the editiors typing in a title for an image are 
> not aware of entities.

They should not need to write any markup in a title!!!

> Anyhow the tag <img title="next >" src="next.gif"> is not correctly 
> processed because of a bug in the parseFunc. I try to fix that bug at 
> the moment.

Let me guess, the code doesn't like the > within the attribute.

>> Nitpicking: does the code cope with attributes enclosed in single 
>> quotes or unenclosed content like <img src='image.gif'> or <img 
>> src=image.gif>? They are valid too (though only as HTML and not as 
>> XHTML).
> 
> Yes, it does. This part of the main regex is capable of this:
> 
> [...]
> (?:
>   \".*?\" # double quoted attribute values
>   |
>   '.*?'    # single quoted attribute values
>   |
>   [^'\">\s]+ # a string without " or ' followed by a space
> )
> [...]

Cool.

Masi