[TYPO3-core] RFC #7942: Enable UTF-8 by default

Benjamin Mack benni at typo3.org
Wed Nov 10 11:27:48 CET 2010


Hey,

this is a SVN patch request.

Type: Feature

Branch: trunk only

BT reference: http://bugs.typo3.org/view.php?id=7942

Problem:
UTF-8 needs to be enabled by default.

Solution:
What needs to be done in order to have TYPO3 be completely unicode:

 - TYPO3 needs to talk UTF-8 all through the core
 - The connection to the database needs to be utf-8

Note: It doesn't matter if the DB is UTF-8 or not, because the database
only needs to know in which format the data is going to be sent from and
to TYPO3 (that is: the connection info). However, we encourage people to
make their DB utf-8 by default.

1) We're just talking about the TYPO3 Backend for now, because that's
where you usually put data in the database. If a backend user is
choosing his language for the backend, TYPO3 takes a character set that
it has defined t3lib_cs->charSetArray that fits to the language. so by
default english or danish is using "iso-8859-1", russian is using
"windows-1251". So far so good. The whole backend is rendered that way
and TYPO3 is also using the chosen character set in order to save it to
the database. This is getting a real mess if you have a backend user
that speaks "english" and another that speaks russian, because then
there are datasets with different character sets in the DB!!! Anyway,
the famous [UTF-8][forceCharset] tells TYPO3 to always use "utf-8" (or
something else) and not use t3lib_cs->charsetArray for that. This means:
forceCharset allows TYPO3 to speak one charset regardless of what
language a BE user has set.

2) The UTF-8 connection is determined through the database. In MySQL
this can be set in the server connection (character_set_connection), but
can also be overriden by sending "SET NAMES utf8" with every connection
establishment.

Imagine some evil setups:

- No forceCharset is set, so multiple users with different languages
(that have different charsets in t3lib_cs->charsetArray) read and write
datasets, even the same datasets. This is chaos.

- forceCharset is set, so TYPO3 always reads and writes data in utf-8,
which is cool. However, if the DB connection is not set, or the DB
server is configured so the connection is "latin1" by default, DB thinks
the UTF-8 data that TYPO3 sends is "latin1", and then re-converts it to
UTF-8 (if the DB is utf-8), or just stores the data as it is in the DB.
This actually works and is no problem, AS LONG AS you don't change the
DB connection to UTF-8, which would result in a mixed setup within the
DB once you read and write again. Here you need a manual upgrade of your
DB, some infos can be found in BT issue #8227
(http://bugs.typo3.org/view.php?id=8227)

These are cases where the TYPO3 installation is messed up big time, and
require a lot of work to change.


Advantages by having UTF-8 by default:

 * If your FE speaks UTF-8 by default as well, no charset conversion is
needed anymore, which will speed up the whole rendering process.
 * Having everything with UTF-8 allows a better transition to v5 (don't
know how this will look like, but we know UTF-8 is better than any mixed
setups :))

So. The attached patch does this:

Deprecation of any other character set than UTF-8. For two versions the
installation can run in other setup, but in 4.7, the option
"forceCharset" will go, because it should always be utf-8 anyways.
Additionally "multiplyDBfieldSize" should have been deprecated for a
long time.

A) config_default.php
First, the two important parameters "forceCharset" and "setDBinit" are
set to "-1", because we need to find out if the parameter was changed in
localconf.php or if the installation still uses the original default
setting. So, if the options are still "-1" after the inclusion of
localconf.php, the installation uses the default setup and has not
modified anything. It is checked if the site has been upgraded already
(compat_version) 4.5. If the site has been upgraded to 4.5 through the
upgrade wizard, the user is on his own.

The whole code in config_default.php could be dropped again in 4.8 when
migration is done for all installation (dunno yet).

B) Helper function in t3lib_db.php to determine if the current
connection is UTF-8. This is useful because this can happen through the
server configuration or overriden via setDBinit.

C) When installing TYPO3 through the 1-2-3 installer create the new
database with UTF-8 by default.

D) Small change in the update wizard code in order to allow some
displaying information without having to show the "next" button all the
time. Helpful to let people know what their setup is.

E) Upgrade wizard, that shows the information about the current
information and a link for a tutorial that explains complex scenarios
and how people could upgrade their Backend + DB to UTF-8. We discourage
people to have an automated way for doing this.

TYPO3 thinks the site has been completely upgraded if:
 - forceCharset has been unset in your localconf.php
 - AND compat_version is set to 4.5

Thanks to Michael Stucki for getting this on the way and explaining
everything. Thanks to Tolleiv Nietsch for testing the patch.


All the best,
Benni.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: utf8_by_default_v2.patch
URL: <http://lists.typo3.org/pipermail/typo3-team-core/attachments/20101110/8b884458/attachment-0001.txt>


More information about the TYPO3-team-core mailing list