[TYPO3-core] RFC #7942: Enable UTF-8 by default

Ernesto Baschny [cron IT] ernst at cron-it.de
Thu Nov 11 10:09:32 CET 2010


Hi Benni,

to facilitate our reviews, it would be good if you could give us an
intermediate feedback if you have the time in the next days to come up
with a new patch considering the comments you received so far.

This feedback would be important for me (and maybe others) to decide if
we should review it today or wait a few more days for the "v2".

Thanks!

Cheers,
Ernesto

Benjamin Mack schrieb am 10.11.2010 11:27:
> Hey,
> 
> this is a SVN patch request.
> 
> Type: Feature
> 
> Branch: trunk only
> 
> BT reference: http://bugs.typo3.org/view.php?id=7942
> 
> Problem:
> UTF-8 needs to be enabled by default.
> 
> Solution:
> What needs to be done in order to have TYPO3 be completely unicode:
> 
>  - TYPO3 needs to talk UTF-8 all through the core
>  - The connection to the database needs to be utf-8
> 
> Note: It doesn't matter if the DB is UTF-8 or not, because the database
> only needs to know in which format the data is going to be sent from and
> to TYPO3 (that is: the connection info). However, we encourage people to
> make their DB utf-8 by default.
> 
> 1) We're just talking about the TYPO3 Backend for now, because that's
> where you usually put data in the database. If a backend user is
> choosing his language for the backend, TYPO3 takes a character set that
> it has defined t3lib_cs->charSetArray that fits to the language. so by
> default english or danish is using "iso-8859-1", russian is using
> "windows-1251". So far so good. The whole backend is rendered that way
> and TYPO3 is also using the chosen character set in order to save it to
> the database. This is getting a real mess if you have a backend user
> that speaks "english" and another that speaks russian, because then
> there are datasets with different character sets in the DB!!! Anyway,
> the famous [UTF-8][forceCharset] tells TYPO3 to always use "utf-8" (or
> something else) and not use t3lib_cs->charsetArray for that. This means:
> forceCharset allows TYPO3 to speak one charset regardless of what
> language a BE user has set.
> 
> 2) The UTF-8 connection is determined through the database. In MySQL
> this can be set in the server connection (character_set_connection), but
> can also be overriden by sending "SET NAMES utf8" with every connection
> establishment.
> 
> Imagine some evil setups:
> 
> - No forceCharset is set, so multiple users with different languages
> (that have different charsets in t3lib_cs->charsetArray) read and write
> datasets, even the same datasets. This is chaos.
> 
> - forceCharset is set, so TYPO3 always reads and writes data in utf-8,
> which is cool. However, if the DB connection is not set, or the DB
> server is configured so the connection is "latin1" by default, DB thinks
> the UTF-8 data that TYPO3 sends is "latin1", and then re-converts it to
> UTF-8 (if the DB is utf-8), or just stores the data as it is in the DB.
> This actually works and is no problem, AS LONG AS you don't change the
> DB connection to UTF-8, which would result in a mixed setup within the
> DB once you read and write again. Here you need a manual upgrade of your
> DB, some infos can be found in BT issue #8227
> (http://bugs.typo3.org/view.php?id=8227)
> 
> These are cases where the TYPO3 installation is messed up big time, and
> require a lot of work to change.
> 
> 
> Advantages by having UTF-8 by default:
> 
>  * If your FE speaks UTF-8 by default as well, no charset conversion is
> needed anymore, which will speed up the whole rendering process.
>  * Having everything with UTF-8 allows a better transition to v5 (don't
> know how this will look like, but we know UTF-8 is better than any mixed
> setups :))
> 
> So. The attached patch does this:
> 
> Deprecation of any other character set than UTF-8. For two versions the
> installation can run in other setup, but in 4.7, the option
> "forceCharset" will go, because it should always be utf-8 anyways.
> Additionally "multiplyDBfieldSize" should have been deprecated for a
> long time.
> 
> A) config_default.php
> First, the two important parameters "forceCharset" and "setDBinit" are
> set to "-1", because we need to find out if the parameter was changed in
> localconf.php or if the installation still uses the original default
> setting. So, if the options are still "-1" after the inclusion of
> localconf.php, the installation uses the default setup and has not
> modified anything. It is checked if the site has been upgraded already
> (compat_version) 4.5. If the site has been upgraded to 4.5 through the
> upgrade wizard, the user is on his own.
> 
> The whole code in config_default.php could be dropped again in 4.8 when
> migration is done for all installation (dunno yet).
> 
> B) Helper function in t3lib_db.php to determine if the current
> connection is UTF-8. This is useful because this can happen through the
> server configuration or overriden via setDBinit.
> 
> C) When installing TYPO3 through the 1-2-3 installer create the new
> database with UTF-8 by default.
> 
> D) Small change in the update wizard code in order to allow some
> displaying information without having to show the "next" button all the
> time. Helpful to let people know what their setup is.
> 
> E) Upgrade wizard, that shows the information about the current
> information and a link for a tutorial that explains complex scenarios
> and how people could upgrade their Backend + DB to UTF-8. We discourage
> people to have an automated way for doing this.
> 
> TYPO3 thinks the site has been completely upgraded if:
>  - forceCharset has been unset in your localconf.php
>  - AND compat_version is set to 4.5
> 
> Thanks to Michael Stucki for getting this on the way and explaining
> everything. Thanks to Tolleiv Nietsch for testing the patch.
> 
> 
> All the best,
> Benni.



More information about the TYPO3-team-core mailing list