[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened tags and counts chars correctly

Sun Apr 6 14:53:26 CEST 2008

Jochen Rau schrieb:
> 
> There are two possible scenarios:
> 1) "blah <foobar> whatever" is parsed with parseFunc, then it will be
> counted as will be counted as "blah <foobar> whatever". The delivered
> HTML-code will be "blah &lt;foobar&gt; whatever" shown as "blah <foobar>
> whatever". That's what is espected.
> 2) "blah <foobar> whatever" is cropped as plain text. Then it  will be
> counted as "blah <foobar> whatever" (leaving aside the
> entities-discussion) delivered as "blah <foobar> whatever" and shown as
> "blah  whatever" (Firefox 2.0.0.13) or "blah whatever" (Safari 3.1) both
> ignoring the unknown tag "<foobar>". That's also what is expected.
> 
> Can we hopefully close this part of the discussion? What do you mean?

Besides that are obviously other other uses cases where the content is 
HTML but it is not processed by parseFunc etc

Let's agree to disagree. IMHO changing a plain text crop to an 
auto-tag-closing, HTML-entitity-aware crop is a change of features and 
not a bug fix.

We agree that it would be cool to have an auto-tag-closing, 
HTML-entitity-aware crop available in TYPO3 possibly but not limited to 
stdWrap.

> 
>>>> As for readability: tastes vary ;-) I was rather confused when I saw 
>>>> the  multiple for-loop that process the text/arrays.
>>>
>>> I know what you mean. I tried to read t3lib_cs ;-)
>>
>> I hope you like mine parts better then Kasper's ;-)
> 
> Most of all I like the method utf8_substr() for their cascading return
> statements ;-)

A matter of taste. Some believe it's better to have one $returnValue 
variable passed on through the complete function body. I don't do.

>>> IMO the last possible solution to solve the charset problem is to 
>>> drop the html_entity_decode() in my patch.
> 
> An other solution is to implement a API-function which handles
> charset2entity and entity2charset conversions for all charsets allowed
> in TYPO3 (see http://bugs.typo3.org/view.php?id=12 )

Possibly. Though the conversion as such is and should not be needed. 
Why? It's perfectly legal (and that's why they exist) to have entities 
outside the scope of the current charset (eg greek letters for math etc 
that are not part of latin1).

>>>>> Furthermore it is not recommended (see PHP-doc) to use entities if 
>>>>> the charset is multibyte.
>>>>
>>>> Where?
>>>
>>> In the whole HTML-code.
>>
>> I meant, where is this advice been given.
> 
> Well, I read a bunch of documentation. I'm still searching in my brain
> ........
> 
> The most helpfull documentation I found on this point is:
> http://www.w3.org/TR/2000/REC-xml-20001006#entproc
> http://www.w3.org/TR/2000/REC-xml-20001006#inliteral

I still see no why shouldn't use character entities in Unicode encodings 
beside the fact that they are (except for &lt; and &quot;) never necessary.

So you could convert all entities for utf-8 besides those two but would 
still have to deal with those and count them for cropping purposes as one.

BTW, mbstr takes also the character width into account latin, greek, 
cyrillic, hebrew, arabaic etc are treated as having the wdith 1 whereas 
Asian glyphs hae the width 2 - except for the so called full-width latin 
characters :-)

>>> Anyhow the tag <img title="next >" src="next.gif"> is not correctly 
>>> processed because of a bug in the parseFunc. I try to fix that bug at 
>>> the moment.
>>
>> Let me guess, the code doesn't like the > within the attribute.
> 
> You're right. It seems to be a combination of an issue in parseFunc and
> t3lib_parsehtml::HTMLcleaner called by HTMLparser_TSbridge() in
> tslib_content. I'm on the prowl for that bug ;-)

That's what I thought, TYPO3 coders think that a > has to be expressed 
as en entity within an attribute. Though for simplicity reasons I'd vote 
to keep it at that. As we need entities (ie htmlspecialchars) anyway for 
", we can easily make the requirement that > has to be &gt;

That would remove from us the burden of writing cool but (in TYPO3 
context) complex regular expressions.

Masi