[TYPO3-core] RFC #7984: Bug: stdWrap.crop now closes opened tags and counts chars correctly
Jochen Rau
j.rau at web.de
Sat Apr 5 17:18:43 CEST 2008
Hello Martin,
>> You can still crop plain text with the patched stdWrap.crop. So I
>> don't think we need a new feature like stdWrap.cropXML. The patch is
>> just a bugfix IMO.
>
> How come? If my content was "blah <foobar> whatever" the patch will not
> count the content of <foobar> and will add </foobar>, so this is a
> change of the behaviour.
A bugfix should always change the behaviour ;-). But I agree with you
<foobar> should not be recognized as a tag. Maybe I should introduce a
list of tags like
$tags=
'a|b|blockquote|body|div|em|font|form|h1|h2|h3|h4|h5|h6|i|li|map|ol|
option|p|pre|sub|sup|select|span|strong|table|td|textarea|tr|u|ul|br|
hr|img|input|area|link'; ;-)
The new patch should solve that problem.
I can't reproduce the the closing </foobar>. Could you please give me a
hint how you produced that?
BTW this happens only if the content isn't parsed by the parseFunc which
converts all <> in non HTML- and non typo-tags into < and >. The
only solution inside this patch is to parseFunc the content after
cropping it.
> > Before I posted the patch I tested
>> three solutions:
>> 1) crawling through the content only with strpos() and preg_match()
>> 2) utilizing a stack and a recursive function
>> 3) splitting the content using preg_split and crawling over the array
>>
>> IMO the last solution is the best because the code is fast, readable
>> and (relatively) easy to maintain and debug (the code blocks that
>> split, crop and close tags are clearly separated).
>
> I'm amazed that splitting the whole content is faster, but the reason is
> probably the size of the content. I agree that is is unlike that you
> crop 500Kb of text, so it will maybe not matter much in real world
> situations.
The algorithm is crawling backwards through the cropped text which is
probably shorter than 500kb.
> As for readability: tastes vary ;-) I was rather confused when I saw the
> multiple for-loop that process the text/arrays.
I know what you mean. I tried to read t3lib_cs ;-)
> BTW, can you not build up a simple stack of open tags, so you easily
> close the tags without having to parse the text again.
Hm. That was my first idea, too. But I ran into trouble while cropping
backwards. I lost the attributes of opening tags. So I have to crawl
through the content array to fetch the appropriate opening tag.
>>> A more serious problem is that you use html_entity_decode() which
>>> supports only a subset of the charset TYPO3 supports.
>>
>> The most important charsets are supported by html_entity_decode()
>> (like utf-8, ISO-8859-1, ISO-8859-15, GB2312 or EUC-JP. It has also a
>> fallback to ISO-8859-1 and throws a warning if a charset is not
>> supported.
>
> Sorry, but this is not enough. The code must work for all charsets.
Ok, let's assume you use a scary charset. What happens is that the
entities are not decoded and thus counted as a bunch of chars. Since the
content is not affected by the decoding. There will only be a few more
chars cropped.
IMO the last possible solution to solve the charset problem is to drop
the html_entity_decode() in my patch.
>> Furthermore it is not recommended (see PHP-doc) to use entities if the
>> charset is multibyte.
>
> Where?
In the whole HTML-code.
>And with multibyte you mean utf-8? There are other multi-byte
> encodings that are not of the Unicode charset. Also you cannot assume
> the content passed to the function contains no entities just becasue it
> is "not recommended".
I meant multi-byte. To be precisely: It's not recommended to use
entities if the the charset comprises the char represented by the
entity. Entities in attributes are not a problem for the patched
stdWrap.crop but may cause strange FE output in general if an old
browser is used (see below)
>> Your suggested (and my former) regex don't take all valid HTML-tags
>> into account like the following example shows:
>>
>> <img title="next >" src="next.gif"> // valid HTML
>
> Hm, I'm used to use entities here but, you're right the HTML 4.01 spec
> only advises HTML authors to use > within attributes.
>
> I fear that I am not alone with that. At least TYPO3 generated content
> will have all attributes values processed by htmlspecialchars().
Ok, I read the specs once again. While > is allowed in attribute
values ü for example is NOT. This should be &uuml; because only
the "special five" (& " ' < >) are allowed (see
http://www.w3.org/TR/xhtml1/#C_12).
The values of attributes should therefore not be processed by
htmlspecialchars() if the char is represented in the used charset.
I believe that most of the editiors typing in a title for an image are
not aware of entities. So it should work for all cases.
Anyhow the tag <img title="next >" src="next.gif"> is not correctly
processed because of a bug in the parseFunc. I try to fix that bug at
the moment.
> Nitpicking: does the code cope with attributes enclosed in single quotes
> or unenclosed content like <img src='image.gif'> or <img src=image.gif>?
> They are valid too (though only as HTML and not as XHTML).
Yes, it does. This part of the main regex is capable of this:
[...]
(?:
\".*?\" # double quoted attribute values
|
'.*?' # single quoted attribute values
|
[^'\">\s]+ # a string without " or ' followed by a space
)
[...]
("Every nit picked is a bug fixed before it occurs." (anonymus) ;-))
Greetings
Jochen
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: bug_7984.diff
Url: http://lists.netfielders.de/pipermail/typo3-team-core/attachments/20080405/48d25800/attachment-0001.txt
More information about the TYPO3-team-core
mailing list