utf8mb4_general_ci vs utf8mb4_unicode_ci #879

pakricard · 2026-03-22T23:27:09Z

pakricard
Mar 22, 2026
Collaborator

Should we move towards utf8mb4_unicode_ci?

My production server is set to utf8mb4_unicode_ci by default, so I checked if I should keep it or move towards standard webERP and go to utf8mb4_general_ci.

After a few searches, it seems like utf8mb4_unicode_ci is more "international", improving sorting for non-English languages and overhead is minimal.

Thoughts?

R

Gemini said

In MySQL, utf8mb4 is the recommended character set as it supports the full range of Unicode, including emojis. The choice between general_ci and unicode_ci collations determines how the database compares and sorts these characters.

Key Differences

1. Accuracy and Sorting Rules

utf8mb4_unicode_ci: Based on the Unicode Collation Algorithm (UCA). It handles complex character expansions and contractions correctly. For example, in some German contexts, it knows that ß is equal to ss.
+1
utf8mb4_general_ci: A simplified set of rules that does not strictly follow the Unicode standard. It performs a "short-cut" comparison by looking at characters one-to-one. It would not treat ß as ss.

2. Performance

utf8mb4_unicode_ci: Slightly slower because the comparison logic is more complex to ensure linguistic accuracy.
utf8mb4_general_ci: Historically faster because the calculations are computationally cheaper. However, on modern CPUs, this performance gap is usually negligible.

3. Language Support

utf8mb4_unicode_ci: Better for applications serving a global audience or multiple languages, as it respects specific cultural sorting rules.
utf8mb4_general_ci: May lead to unexpected sorting results in non-English languages where accented characters or ligatures have specific meanings.

Comparison Table

Summary Recommendation

Use utf8mb4_unicode_ci

This is the modern standard. Unless you are working on a legacy system with extreme performance bottlenecks or specific hardware constraints, you should use utf8mb4_unicode_ci (or even better, utf8mb4_0900_ai_ci if you are on MySQL 8.0+) to ensure data integrity and correct sorting across all languages.

Use utf8mb4_general_ci

Only use this if you are maintaining an older database where this collation is already established, or if you have benchmarked your specific application and found that the collation overhead is causing a significant bottleneck.

Gemini said

In MySQL, utf8mb4 is the recommended character set as it supports the full range of Unicode, including emojis. The choice between general_ci and unicode_ci collations determines how the database compares and sorts these characters.

Key Differences

1. Accuracy and Sorting Rules

utf8mb4_unicode_ci: Based on the Unicode Collation Algorithm (UCA). It handles complex character expansions and contractions correctly. For example, in some German contexts, it knows that ß is equal to ss.
+1
utf8mb4_general_ci: A simplified set of rules that does not strictly follow the Unicode standard. It performs a "short-cut" comparison by looking at characters one-to-one. It would not treat ß as ss.

2. Performance

utf8mb4_unicode_ci: Slightly slower because the comparison logic is more complex to ensure linguistic accuracy.
utf8mb4_general_ci: Historically faster because the calculations are computationally cheaper. However, on modern CPUs, this performance gap is usually negligible.

3. Language Support

utf8mb4_unicode_ci: Better for applications serving a global audience or multiple languages, as it respects specific cultural sorting rules.
utf8mb4_general_ci: May lead to unexpected sorting results in non-English languages where accented characters or ligatures have specific meanings.

Comparison Table

Summary Recommendation

Use utf8mb4_unicode_ci

This is the modern standard. Unless you are working on a legacy system with extreme performance bottlenecks or specific hardware constraints, you should use utf8mb4_unicode_ci (or even better, utf8mb4_0900_ai_ci if you are on MySQL 8.0+) to ensure data integrity and correct sorting across all languages.

Use utf8mb4_general_ci

Only use this if you are maintaining an older database where this collation is already established, or if you have benchmarked your specific application and found that the collation overhead is causing a significant bottleneck.

timschofield · 2026-03-23T07:32:08Z

timschofield
Mar 23, 2026
Maintainer

This is outside my area of expertise so I will be guided by others.

I always thought that it should be up to the individual organisation, but like I say I don't know enough about it.

Tim

0 replies

dalers · 2026-03-23T20:49:34Z

dalers
Mar 23, 2026
Collaborator

What is the goal?

if the goal is lower processing cost to support older or less expensive servers[1], then we should continue to use utf8mb4_general_ci.
if the goal is linguistically correct sort order, then we should change to utf8mb4_unicode_ci. However I suspect the difference in sort order likely has no relevance for webERP, which I'm guessing will be sorting by fields that are not linguistic by nature anyway, e.g. date, dollar amount, account number, vendor name, customer name, stockid, etc.

Another goal might be the least effort to implement and maintain. Is changing the collation from utf8mb4_general_ci to utf8mb4_unicode_ci worth the effort? Do we have to impose collation in the first place? Can it be irrelevent to webERP whether one server defaults to utf8mbr_general_ci and another defaults to utf8mbr_unicode_ci?

I'm not perceiving there is an overwhelming reason to change so I vote for staying as-is with utf8mbf_general_ci.

Fwiw, I checked table collation in the other apps I'm hosting.

Leantime: utf8mb4_unicode_ci
Mantis Bug Tracker: utf8mb3_general_ci
Moodle: utf8mb4_unicode_ci
ProjeQtOr: utf8mb4_general_ci
Nextcloud: utf8mb4_bin (utf8mb4_bin sorts by binary value, sorts are case sensitive and upper case sort before lower case among other differences, and I assume used because utf8mb4_bin is the fastest sort and Nextcloud deals with linguistic sorting some other way when necessary)
SeedDMS: utf8mb3_general_ci
SuiteCRM: utf8mb3_general_ci
TimeTracker: utf8mb4_general_ci
WackoWiki: utf8mb4_unicode_520_ci (a more modern, smarter version of unicode_ci)
WordPress: tables are a mix of utf8mb4_unicode_520_ci, utf8mb3_general_ci, utf8mb4_unicode_ci and latin1_swedish_ci (in general utf8mb4_unicode_ci seems used for content or comments to it).

[1] I thought low processing requirements was a stated goal for webERP but I don't see it listed on the weberp.org website (only "...scripts are written to maximise readability for business users" and "...any web server that supports PHP. Use your own server or a managed server from an ISP to avoid purchase and maintenance costs.").

1 reply

timschofield Mar 23, 2026
Maintainer

[1] I thought low processing requirements was a stated goal for webERP but I don't see it listed on the weberp.org website (only "...scripts are written to maximise readability for business users" and "...any web server that supports PHP. Use your own server or a managed server from an ISP to avoid purchase and maintenance costs.").

I think webERP had it's origins in the days of dialup internet, and low powered computers. The world moved on, and it seemed silly to carry on catering for such things.

pakricard · 2026-03-23T23:06:44Z

pakricard
Mar 23, 2026
Collaborator Author

OK, then. I will keep utf8mb4_unicode_ci on my install, and modify CreateTable accordingly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8mb4_general_ci vs utf8mb4_unicode_ci #879

Uh oh!

{{title}}

Uh oh!

Key Differences

1. Accuracy and Sorting Rules

2. Performance

3. Language Support

Comparison Table

Key Differences

1. Accuracy and Sorting Rules

2. Performance

3. Language Support

Comparison Table

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

utf8mb4_general_ci vs utf8mb4_unicode_ci #879

Uh oh!

pakricard Mar 22, 2026 Collaborator

Gemini said

Key Differences

1. Accuracy and Sorting Rules

2. Performance

3. Language Support

Comparison Table

Summary Recommendation

Use utf8mb4_unicode_ci

Use utf8mb4_general_ci

Gemini said

Key Differences

1. Accuracy and Sorting Rules

2. Performance

3. Language Support

Comparison Table

Summary Recommendation

Use utf8mb4_unicode_ci

Use utf8mb4_general_ci

Replies: 3 comments · 1 reply

Uh oh!

timschofield Mar 23, 2026 Maintainer

Uh oh!

Uh oh!

dalers Mar 23, 2026 Collaborator

Uh oh!

timschofield Mar 23, 2026 Maintainer

Uh oh!

pakricard Mar 23, 2026 Collaborator Author

pakricard
Mar 22, 2026
Collaborator

Replies: 3 comments 1 reply

timschofield
Mar 23, 2026
Maintainer

dalers
Mar 23, 2026
Collaborator

timschofield Mar 23, 2026
Maintainer

pakricard
Mar 23, 2026
Collaborator Author