[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened tags and counts chars correctly

Sun Apr 6 02:28:57 CEST 2008

Hello Martin,

>>>> You can still crop plain text with the patched stdWrap.crop. So I 
>>>> don't think we need a new feature like stdWrap.cropXML. The patch is 
>>>> just a bugfix IMO.
>>>
>>> How come? If my content was "blah <foobar> whatever" the patch will 
>>> not count the content of <foobar> and will add </foobar>, so this is 
>>> a change of the behaviour.
>>
>> A bugfix should always change the behaviour ;-). But I agree with you 
>> <foobar> should not be recognized as a tag.
> 
> That wasn't my point. My point was that I want to crop my example string 
> as plain text! The cropping simply shouldn't care about any < characters.
> 
> That's why we either need a new stdWrap function or a new argumebt to 
> stdWrap.crop, eg stdWrap.crop.html that enables "html-mode":

There are two possible scenarios:
1) "blah <foobar> whatever" is parsed with parseFunc, then it will be
counted as will be counted as "blah <foobar> whatever". The delivered
HTML-code will be "blah &lt;foobar&gt; whatever" shown as "blah <foobar>
whatever". That's what is espected.
2) "blah <foobar> whatever" is cropped as plain text. Then it  will be
counted as "blah <foobar> whatever" (leaving aside the
entities-discussion) delivered as "blah <foobar> whatever" and shown as
"blah  whatever" (Firefox 2.0.0.13) or "blah whatever" (Safari 3.1) both
ignoring the unknown tag "<foobar>". That's also what is expected.

Can we hopefully close this part of the discussion? What do you mean?

>>> As for readability: tastes vary ;-) I was rather confused when I saw 
>>> the  multiple for-loop that process the text/arrays.
>>
>> I know what you mean. I tried to read t3lib_cs ;-)
> 
> I hope you like mine parts better then Kasper's ;-)

Most of all I like the method utf8_substr() for their cascading return
statements ;-)

>>>>> A more serious problem is that you use html_entity_decode() which 
>>>>> supports only a subset of the charset TYPO3 supports.
>>>>
>>>> The most important charsets are supported by html_entity_decode() 
>>>> (like utf-8, ISO-8859-1, ISO-8859-15, GB2312 or EUC-JP. It has also 
>>>> a fallback to ISO-8859-1 and throws a warning if a charset is not 
>>>> supported.
>>>
>>> Sorry, but this is not enough. The code must work for all charsets.
>>
>> Ok, let's assume you use a scary charset. What happens is that the 
>> entities are not decoded and thus counted as a bunch of chars. Since 
>> the content is not affected by the decoding. There will only be a few 
>> more chars cropped.
> 
> Hoepfully not within the entity.

Ok, you hit me - another bug in the original stdWrap.crop still not
fixed. A future patch should handle this, too.

>> IMO the last possible solution to solve the charset problem is to drop 
>> the html_entity_decode() in my patch.

An other solution is to implement a API-function which handles
charset2entity and entity2charset conversions for all charsets allowed
in TYPO3 (see http://bugs.typo3.org/view.php?id=12 )

>>>> Furthermore it is not recommended (see PHP-doc) to use entities if 
>>>> the charset is multibyte.
>>>
>>> Where?
>>
>> In the whole HTML-code.
> 
> I meant, where is this advice been given.

Well, I read a bunch of documentation. I'm still searching in my brain
........

The most helpfull documentation I found on this point is:
http://www.w3.org/TR/2000/REC-xml-20001006#entproc
http://www.w3.org/TR/2000/REC-xml-20001006#inliteral

>>> And with multibyte you mean utf-8? There are other multi-byte 
>>> encodings that are not of the Unicode charset. Also you cannot assume 
>>> the content passed to the function contains no entities just becasue 
>>> it is "not recommended".
>>
>> I meant multi-byte. To be precisely: It's not recommended to use 
>> entities if the the charset comprises the char represented by the entity.
> 
> That amounts to "you don't need to use entities if your charset contains 
> the desired characters".

Yes.

>> Ok, I read the specs once again. While &gt; is allowed in attribute 
>> values &uuml; for example is NOT. This should be &amp;uuml; because 
>> only the "special five" (&amp; &quot; &apos; &lt; &gt;) are allowed 
>> (see http://www.w3.org/TR/xhtml1/#C_12).
> 
> You misinterpret this. This is about single & in an attribute, eg within 
> a URL. Of course entities are allowed in attributes.

Well, you're right. But if you want to be save you should only use the 
"special five". See article
http://www.phpwact.org/php/i18n/charsets#common_problem_areas_with_utf-8

>> I believe that most of the editiors typing in a title for an image are 
>> not aware of entities.
> 
> They should not need to write any markup in a title!!!

I agree. They should be able to enter ">" instead of "&gt;" for
instance. So that's what I voted for. This is not possible until we
fixed the new encountered bug described below.

>> Anyhow the tag <img title="next >" src="next.gif"> is not correctly 
>> processed because of a bug in the parseFunc. I try to fix that bug at 
>> the moment.
> 
> Let me guess, the code doesn't like the > within the attribute.

You're right. It seems to be a combination of an issue in parseFunc and
t3lib_parsehtml::HTMLcleaner called by HTMLparser_TSbridge() in
tslib_content. I'm on the prowl for that bug ;-)

>>> Nitpicking: does the code cope with attributes enclosed in single 
>>> quotes or unenclosed content like <img src='image.gif'> or <img 
>>> src=image.gif>? They are valid too (though only as HTML and not as 
>>> XHTML).

>> Yes, it does. This part of the main regex is capable of this:
>>
>> [...]
>> (?:
>>   \".*?\" # double quoted attribute values
>>   |
>>   '.*?'    # single quoted attribute values
>>   |
>>   [^'\">\s]+ # a string without " or ' followed by a space
>> )
>> [...]
> 
> Cool.

Thanx. The regex is based on a snippet described in Jeffrey E. F.
Friedl's "Regular Expressions" (you surely have already detected it: I
love regular expressions ;-) ).

Greetings
Jochen