[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened tags and counts chars correctly
Martin Kutschker
martin.kutschker-n0spam at no5pam-blackbox.net
Sat Apr 5 19:41:59 CEST 2008
Jochen Rau schrieb:
> Hello Martin,
>
>>> You can still crop plain text with the patched stdWrap.crop. So I
>>> don't think we need a new feature like stdWrap.cropXML. The patch is
>>> just a bugfix IMO.
>>
>> How come? If my content was "blah <foobar> whatever" the patch will
>> not count the content of <foobar> and will add </foobar>, so this is a
>> change of the behaviour.
>
> A bugfix should always change the behaviour ;-). But I agree with you
> <foobar> should not be recognized as a tag.
That wasn't my point. My point was that I want to crop my example string
as plain text! The cropping simply shouldn't care about any < characters.
That's why we either need a new stdWrap function or a new argumebt to
stdWrap.crop, eg stdWrap.crop.html that enables "html-mode":
stdWrap.crop = <content>
stdWrap.crop.html = 1
> I can't reproduce the the closing </foobar>. Could you please give me a
> hint how you produced that?
Ok, didn't actually try it, bit your description suggested auto-closing
of tags. So in theory foobar should have been closed.
>> As for readability: tastes vary ;-) I was rather confused when I saw
>> the multiple for-loop that process the text/arrays.
>
> I know what you mean. I tried to read t3lib_cs ;-)
I hope you like mine parts better then Kasper's ;-)
>>>> A more serious problem is that you use html_entity_decode() which
>>>> supports only a subset of the charset TYPO3 supports.
>>>
>>> The most important charsets are supported by html_entity_decode()
>>> (like utf-8, ISO-8859-1, ISO-8859-15, GB2312 or EUC-JP. It has also a
>>> fallback to ISO-8859-1 and throws a warning if a charset is not
>>> supported.
>>
>> Sorry, but this is not enough. The code must work for all charsets.
>
> Ok, let's assume you use a scary charset. What happens is that the
> entities are not decoded and thus counted as a bunch of chars. Since the
> content is not affected by the decoding. There will only be a few more
> chars cropped.
Hoepfully not within the entity.
> IMO the last possible solution to solve the charset problem is to drop
> the html_entity_decode() in my patch.
>
>>> Furthermore it is not recommended (see PHP-doc) to use entities if
>>> the charset is multibyte.
>>
>> Where?
>
> In the whole HTML-code.
I meant, where is this advice been given.
>
>> And with multibyte you mean utf-8? There are other multi-byte
>> encodings that are not of the Unicode charset. Also you cannot assume
>> the content passed to the function contains no entities just becasue
>> it is "not recommended".
>
> I meant multi-byte. To be precisely: It's not recommended to use
> entities if the the charset comprises the char represented by the
> entity.
That amounts to "you don't need to use entities if your charset contains
the desired characters".
> Ok, I read the specs once again. While > is allowed in attribute
> values ü for example is NOT. This should be &uuml; because only
> the "special five" (& " ' < >) are allowed (see
> http://www.w3.org/TR/xhtml1/#C_12).
You misinterpret this. This is about single & in an attribute, eg within
a URL. Of course entities are allowed in attributes.
> I believe that most of the editiors typing in a title for an image are
> not aware of entities.
They should not need to write any markup in a title!!!
> Anyhow the tag <img title="next >" src="next.gif"> is not correctly
> processed because of a bug in the parseFunc. I try to fix that bug at
> the moment.
Let me guess, the code doesn't like the > within the attribute.
>> Nitpicking: does the code cope with attributes enclosed in single
>> quotes or unenclosed content like <img src='image.gif'> or <img
>> src=image.gif>? They are valid too (though only as HTML and not as
>> XHTML).
>
> Yes, it does. This part of the main regex is capable of this:
>
> [...]
> (?:
> \".*?\" # double quoted attribute values
> |
> '.*?' # single quoted attribute values
> |
> [^'\">\s]+ # a string without " or ' followed by a space
> )
> [...]
Cool.
Masi
More information about the TYPO3-team-core
mailing list