[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened tags and counts chars correctly
Jochen Rau
j.rau at web.de
Sun Apr 6 02:28:57 CEST 2008
Hello Martin,
>>>> You can still crop plain text with the patched stdWrap.crop. So I
>>>> don't think we need a new feature like stdWrap.cropXML. The patch is
>>>> just a bugfix IMO.
>>>
>>> How come? If my content was "blah <foobar> whatever" the patch will
>>> not count the content of <foobar> and will add </foobar>, so this is
>>> a change of the behaviour.
>>
>> A bugfix should always change the behaviour ;-). But I agree with you
>> <foobar> should not be recognized as a tag.
>
> That wasn't my point. My point was that I want to crop my example string
> as plain text! The cropping simply shouldn't care about any < characters.
>
> That's why we either need a new stdWrap function or a new argumebt to
> stdWrap.crop, eg stdWrap.crop.html that enables "html-mode":
There are two possible scenarios:
1) "blah <foobar> whatever" is parsed with parseFunc, then it will be
counted as will be counted as "blah <foobar> whatever". The delivered
HTML-code will be "blah <foobar> whatever" shown as "blah <foobar>
whatever". That's what is espected.
2) "blah <foobar> whatever" is cropped as plain text. Then it will be
counted as "blah <foobar> whatever" (leaving aside the
entities-discussion) delivered as "blah <foobar> whatever" and shown as
"blah whatever" (Firefox 2.0.0.13) or "blah whatever" (Safari 3.1) both
ignoring the unknown tag "<foobar>". That's also what is expected.
Can we hopefully close this part of the discussion? What do you mean?
>>> As for readability: tastes vary ;-) I was rather confused when I saw
>>> the multiple for-loop that process the text/arrays.
>>
>> I know what you mean. I tried to read t3lib_cs ;-)
>
> I hope you like mine parts better then Kasper's ;-)
Most of all I like the method utf8_substr() for their cascading return
statements ;-)
>>>>> A more serious problem is that you use html_entity_decode() which
>>>>> supports only a subset of the charset TYPO3 supports.
>>>>
>>>> The most important charsets are supported by html_entity_decode()
>>>> (like utf-8, ISO-8859-1, ISO-8859-15, GB2312 or EUC-JP. It has also
>>>> a fallback to ISO-8859-1 and throws a warning if a charset is not
>>>> supported.
>>>
>>> Sorry, but this is not enough. The code must work for all charsets.
>>
>> Ok, let's assume you use a scary charset. What happens is that the
>> entities are not decoded and thus counted as a bunch of chars. Since
>> the content is not affected by the decoding. There will only be a few
>> more chars cropped.
>
> Hoepfully not within the entity.
Ok, you hit me - another bug in the original stdWrap.crop still not
fixed. A future patch should handle this, too.
>> IMO the last possible solution to solve the charset problem is to drop
>> the html_entity_decode() in my patch.
An other solution is to implement a API-function which handles
charset2entity and entity2charset conversions for all charsets allowed
in TYPO3 (see http://bugs.typo3.org/view.php?id=12 )
>>>> Furthermore it is not recommended (see PHP-doc) to use entities if
>>>> the charset is multibyte.
>>>
>>> Where?
>>
>> In the whole HTML-code.
>
> I meant, where is this advice been given.
Well, I read a bunch of documentation. I'm still searching in my brain
........
The most helpfull documentation I found on this point is:
http://www.w3.org/TR/2000/REC-xml-20001006#entproc
http://www.w3.org/TR/2000/REC-xml-20001006#inliteral
>>> And with multibyte you mean utf-8? There are other multi-byte
>>> encodings that are not of the Unicode charset. Also you cannot assume
>>> the content passed to the function contains no entities just becasue
>>> it is "not recommended".
>>
>> I meant multi-byte. To be precisely: It's not recommended to use
>> entities if the the charset comprises the char represented by the entity.
>
> That amounts to "you don't need to use entities if your charset contains
> the desired characters".
Yes.
>> Ok, I read the specs once again. While > is allowed in attribute
>> values ü for example is NOT. This should be &uuml; because
>> only the "special five" (& " ' < >) are allowed
>> (see http://www.w3.org/TR/xhtml1/#C_12).
>
> You misinterpret this. This is about single & in an attribute, eg within
> a URL. Of course entities are allowed in attributes.
Well, you're right. But if you want to be save you should only use the
"special five". See article
http://www.phpwact.org/php/i18n/charsets#common_problem_areas_with_utf-8
>> I believe that most of the editiors typing in a title for an image are
>> not aware of entities.
>
> They should not need to write any markup in a title!!!
I agree. They should be able to enter ">" instead of ">" for
instance. So that's what I voted for. This is not possible until we
fixed the new encountered bug described below.
>> Anyhow the tag <img title="next >" src="next.gif"> is not correctly
>> processed because of a bug in the parseFunc. I try to fix that bug at
>> the moment.
>
> Let me guess, the code doesn't like the > within the attribute.
You're right. It seems to be a combination of an issue in parseFunc and
t3lib_parsehtml::HTMLcleaner called by HTMLparser_TSbridge() in
tslib_content. I'm on the prowl for that bug ;-)
>>> Nitpicking: does the code cope with attributes enclosed in single
>>> quotes or unenclosed content like <img src='image.gif'> or <img
>>> src=image.gif>? They are valid too (though only as HTML and not as
>>> XHTML).
>> Yes, it does. This part of the main regex is capable of this:
>>
>> [...]
>> (?:
>> \".*?\" # double quoted attribute values
>> |
>> '.*?' # single quoted attribute values
>> |
>> [^'\">\s]+ # a string without " or ' followed by a space
>> )
>> [...]
>
> Cool.
Thanx. The regex is based on a snippet described in Jeffrey E. F.
Friedl's "Regular Expressions" (you surely have already detected it: I
love regular expressions ;-) ).
Greetings
Jochen
More information about the TYPO3-team-core
mailing list