[TYPO3-core] DBAL and column collations

Jigal van Hemert jigal.van.hemert at typo3.org
Tue Jan 21 14:41:56 CET 2014


Hi,

On 14-1-2014 12:36, Ingo Schmitt wrote:
> while working on the SQL performance issues for TYPO3 CMS there was the
> discussion about changing the column collations for colums which carry
> only automatic generated hex values to ASCI.
> @see https://review.typo3.org/#/c/26628/2

This seems to be a patch which adds an index to a table; not sure if 
that's the patch you mean?

>  From the MySQL view this change is very fine. These columns are mostly
> used as an automatic key for e.G. the caching framework. By setting to
> ASCI we could reduce the amount of bytes needed per character by 3
> bytes. A 32 character would need 96 bytes less than the current
> situation. So more results could be stored in the query cache.

Not really. It very much depends on the the character set you use in the 
database. If you use UTF-8 then a character may take *up to* three 
bytes. UTF-8 is a character set that uses a different number of bytes 
for different characters. The low ASCII set and most of the ISO-8859-1 
characters only take a single byte. Some other characters take two bytes 
and certain characters take three bytes per character.
For hexadecimal representations of numbers only a single byte per 
character is used.
If you use UCS2 every character will use 2 bytes no matter which 
character it is.

For CHAR fields MySQL must reserve the number of bytes as defined by the 
length and the character set. For UTF-8 it means that it will reserve 3 
times the length.
For VARCHAR fields MySQL does not need to reserver any number of bytes 
per character as it's flexible by nature.

Other DBMS may use different techniques.

Besides storage there is also the issue of conversion. We use UTF-8 as 
character set for the database connection. If a column has a different 
character set then the database must convert the contents first to the 
character set used for the connection.
Even though the conversion from lower ASCII to UTF-8 is very simple 
there will be a call to a conversion function which will have to do a 
lookup in some character maps.

> Also this would lead to a smaller index for this column, depending on
> the storage engine / database engine.

Indexes can have a quite a big size (per row). If you run into the 
maximum size limitation it's mostly enough to use part of the field. The 
purpose of an index (not the primary index or a unique one) is to 
prevent full table searches. Even if there are records which lead to 
identical index values then there you'll end up with a very limited 
number of potential records and the query will be much faster.

> But having DBAL in mind, my knowledge ends if there is something like
> per column collations in other DBMS or even if this is defined in the
> SQL-92/SQL-99 Standard. And on which standard is TYPO3 build?

Maybe we can ask Xavier to make an overview of the storage requirements 
for the column types we use, also for the limitations of indexes, 
etcetera. If all current versions of the various DBMS are taken into 
consideration we can form guidelines for sizes of columns, indexes, rows.

-- 
Jigal van Hemert
TYPO3 CMS Active Contributor

TYPO3 .... inspiring people to share!
Get involved: typo3.org


More information about the TYPO3-team-core mailing list