[TYPO3-core] RFC #7942: Enable UTF-8 by default

Michael Stucki michael at typo3.org
Thu Nov 11 09:12:56 CET 2010


Hi Benni,

big thanks for bringing this to an RFC! I had a look at the patch as it
has become now, and have some more or less minor notes about it:

config_default.php:

- At the beginning of the checks, I suggest to add a comment that "-1"
means "not defined in localconf.php". Just to make things more clear.

- The deprecation warning says that "Only UTF-8 is supported since TYPO3
4.5". I think that is not correct, as we still support alternatives
until 4.7 - don't we? deprecated != supported

- The stripos check for "SET NAMES utf8" should be changed to a
preg_match which considers more settings that have the same intention,
like "SET CHARACTER SET utf8" (which is wrong but sometimes used, as I
wrote yesterday), and optionally accepts quotes around the charset.

class.t3lib_db.php:

- The check for setDBinit in connectionIsUtf8() should be changed
similar like above.

class.tx_install.php:

- Oh that's nice: A file is required but it's missing in your patch. I'm
wondering how others could try the patch if it cannot work? :-)

- Optimization: Run connectionIsUtf8() once and store the result instead
of running it 4 times in a row.

- There should be an additional check for the database charset, and a
warning should be triggered if utf8 should be used but this field has a
different setting. The database charset has no influence on any tables
or something, it only marks the default for new tables which have no
charset specified.

I wouldn't go that far to also check the charsets of all fields (which
may be different than the table charset) since this cannot happen
without manual interaction.

If the tables are non-utf8 but everything else is configured to use
utf8, we actually have an incomplete setup!

Generally, if the client+connection+result charsets have UTF-8 then the
setup should work ok even if TYPO3 and tables both use latin1.
Everything can be stored correctly, since MySQL will convert the content
on the fly and TYPO3 thinks that the MySQL data is latin1.
(Can someone else confirm this please, since I haven't tested it, it's
just a thought.)

- This means that all the 4 checks about the settings are wrong or at
  least incomplete...


All in all, I think the patch is good as it is for a start. The warnings
may be confusing and the update wizard is likely to become the most
complicated part of this whole change. We need to make sure that it
won't say everything is perfect while in reality the setup is broken.

I would still like to see the missing file but besides this, I think the
patch should be committed as it is, since fine-tuning is still needed
and things can quickly get very complicated...

- michael

Am 10.11.2010 11:27, schrieb Benjamin Mack:
> Hey,
> 
> this is a SVN patch request.
> 
> Type: Feature
> 
> Branch: trunk only
> 
> BT reference: http://bugs.typo3.org/view.php?id=7942
> 
> Problem:
> UTF-8 needs to be enabled by default.
> 
> Solution:
> What needs to be done in order to have TYPO3 be completely unicode:
> 
>  - TYPO3 needs to talk UTF-8 all through the core
>  - The connection to the database needs to be utf-8
> 
> Note: It doesn't matter if the DB is UTF-8 or not, because the database
> only needs to know in which format the data is going to be sent from and
> to TYPO3 (that is: the connection info). However, we encourage people to
> make their DB utf-8 by default.
> 
> 1) We're just talking about the TYPO3 Backend for now, because that's
> where you usually put data in the database. If a backend user is
> choosing his language for the backend, TYPO3 takes a character set that
> it has defined t3lib_cs->charSetArray that fits to the language. so by
> default english or danish is using "iso-8859-1", russian is using
> "windows-1251". So far so good. The whole backend is rendered that way
> and TYPO3 is also using the chosen character set in order to save it to
> the database. This is getting a real mess if you have a backend user
> that speaks "english" and another that speaks russian, because then
> there are datasets with different character sets in the DB!!! Anyway,
> the famous [UTF-8][forceCharset] tells TYPO3 to always use "utf-8" (or
> something else) and not use t3lib_cs->charsetArray for that. This means:
> forceCharset allows TYPO3 to speak one charset regardless of what
> language a BE user has set.
> 
> 2) The UTF-8 connection is determined through the database. In MySQL
> this can be set in the server connection (character_set_connection), but
> can also be overriden by sending "SET NAMES utf8" with every connection
> establishment.
> 
> Imagine some evil setups:
> 
> - No forceCharset is set, so multiple users with different languages
> (that have different charsets in t3lib_cs->charsetArray) read and write
> datasets, even the same datasets. This is chaos.
> 
> - forceCharset is set, so TYPO3 always reads and writes data in utf-8,
> which is cool. However, if the DB connection is not set, or the DB
> server is configured so the connection is "latin1" by default, DB thinks
> the UTF-8 data that TYPO3 sends is "latin1", and then re-converts it to
> UTF-8 (if the DB is utf-8), or just stores the data as it is in the DB.
> This actually works and is no problem, AS LONG AS you don't change the
> DB connection to UTF-8, which would result in a mixed setup within the
> DB once you read and write again. Here you need a manual upgrade of your
> DB, some infos can be found in BT issue #8227
> (http://bugs.typo3.org/view.php?id=8227)
> 
> These are cases where the TYPO3 installation is messed up big time, and
> require a lot of work to change.
> 
> 
> Advantages by having UTF-8 by default:
> 
>  * If your FE speaks UTF-8 by default as well, no charset conversion is
> needed anymore, which will speed up the whole rendering process.
>  * Having everything with UTF-8 allows a better transition to v5 (don't
> know how this will look like, but we know UTF-8 is better than any mixed
> setups :))
> 
> So. The attached patch does this:
> 
> Deprecation of any other character set than UTF-8. For two versions the
> installation can run in other setup, but in 4.7, the option
> "forceCharset" will go, because it should always be utf-8 anyways.
> Additionally "multiplyDBfieldSize" should have been deprecated for a
> long time.
> 
> A) config_default.php
> First, the two important parameters "forceCharset" and "setDBinit" are
> set to "-1", because we need to find out if the parameter was changed in
> localconf.php or if the installation still uses the original default
> setting. So, if the options are still "-1" after the inclusion of
> localconf.php, the installation uses the default setup and has not
> modified anything. It is checked if the site has been upgraded already
> (compat_version) 4.5. If the site has been upgraded to 4.5 through the
> upgrade wizard, the user is on his own.
> 
> The whole code in config_default.php could be dropped again in 4.8 when
> migration is done for all installation (dunno yet).
> 
> B) Helper function in t3lib_db.php to determine if the current
> connection is UTF-8. This is useful because this can happen through the
> server configuration or overriden via setDBinit.
> 
> C) When installing TYPO3 through the 1-2-3 installer create the new
> database with UTF-8 by default.
> 
> D) Small change in the update wizard code in order to allow some
> displaying information without having to show the "next" button all the
> time. Helpful to let people know what their setup is.
> 
> E) Upgrade wizard, that shows the information about the current
> information and a link for a tutorial that explains complex scenarios
> and how people could upgrade their Backend + DB to UTF-8. We discourage
> people to have an automated way for doing this.
> 
> TYPO3 thinks the site has been completely upgraded if:
>  - forceCharset has been unset in your localconf.php
>  - AND compat_version is set to 4.5
> 
> Thanks to Michael Stucki for getting this on the way and explaining
> everything. Thanks to Tolleiv Nietsch for testing the patch.
> 
> 
> All the best,
> Benni.


-- 
Use a newsreader! Check out
http://typo3.org/community/mailing-lists/use-a-news-reader/


More information about the TYPO3-team-core mailing list