utf8mb4_general_ci vs utf8mb4_unicode_ci #879
Replies: 3 comments 1 reply
-
|
This is outside my area of expertise so I will be guided by others. I always thought that it should be up to the individual organisation, but like I say I don't know enough about it. Tim |
Beta Was this translation helpful? Give feedback.
-
|
What is the goal?
Another goal might be the least effort to implement and maintain. Is changing the collation from utf8mb4_general_ci to utf8mb4_unicode_ci worth the effort? Do we have to impose collation in the first place? Can it be irrelevent to webERP whether one server defaults to utf8mbr_general_ci and another defaults to utf8mbr_unicode_ci? I'm not perceiving there is an overwhelming reason to change so I vote for staying as-is with utf8mbf_general_ci. Fwiw, I checked table collation in the other apps I'm hosting.
[1] I thought low processing requirements was a stated goal for webERP but I don't see it listed on the weberp.org website (only "...scripts are written to maximise readability for business users" and "...any web server that supports PHP. Use your own server or a managed server from an ISP to avoid purchase and maintenance costs."). |
Beta Was this translation helpful? Give feedback.
-
|
OK, then. I will keep utf8mb4_unicode_ci on my install, and modify CreateTable accordingly. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Should we move towards utf8mb4_unicode_ci?
My production server is set to utf8mb4_unicode_ci by default, so I checked if I should keep it or move towards standard webERP and go to utf8mb4_general_ci.
After a few searches, it seems like utf8mb4_unicode_ci is more "international", improving sorting for non-English languages and overhead is minimal.
Thoughts?
R
Gemini said
In MySQL, utf8mb4 is the recommended character set as it supports the full range of Unicode, including emojis. The choice between
general_ciandunicode_cicollations determines how the database compares and sorts these characters.Key Differences
1. Accuracy and Sorting Rules
utf8mb4_unicode_ci: Based on the Unicode Collation Algorithm (UCA). It handles complex character expansions and contractions correctly. For example, in some German contexts, it knows that ß is equal to ss.
utf8mb4_general_ci: A simplified set of rules that does not strictly follow the Unicode standard. It performs a "short-cut" comparison by looking at characters one-to-one. It would not treat ß as ss.
2. Performance
utf8mb4_unicode_ci: Slightly slower because the comparison logic is more complex to ensure linguistic accuracy.
utf8mb4_general_ci: Historically faster because the calculations are computationally cheaper. However, on modern CPUs, this performance gap is usually negligible.
3. Language Support
utf8mb4_unicode_ci: Better for applications serving a global audience or multiple languages, as it respects specific cultural sorting rules.
utf8mb4_general_ci: May lead to unexpected sorting results in non-English languages where accented characters or ligatures have specific meanings.
Comparison Table
Summary Recommendation
Use utf8mb4_unicode_ci
This is the modern standard. Unless you are working on a legacy system with extreme performance bottlenecks or specific hardware constraints, you should use
utf8mb4_unicode_ci(or even better,utf8mb4_0900_ai_ciif you are on MySQL 8.0+) to ensure data integrity and correct sorting across all languages.Use utf8mb4_general_ci
Only use this if you are maintaining an older database where this collation is already established, or if you have benchmarked your specific application and found that the collation overhead is causing a significant bottleneck.
Gemini said
In MySQL, utf8mb4 is the recommended character set as it supports the full range of Unicode, including emojis. The choice between
general_ciandunicode_cicollations determines how the database compares and sorts these characters.Key Differences
1. Accuracy and Sorting Rules
utf8mb4_unicode_ci: Based on the Unicode Collation Algorithm (UCA). It handles complex character expansions and contractions correctly. For example, in some German contexts, it knows that ß is equal to ss.
utf8mb4_general_ci: A simplified set of rules that does not strictly follow the Unicode standard. It performs a "short-cut" comparison by looking at characters one-to-one. It would not treat ß as ss.
2. Performance
utf8mb4_unicode_ci: Slightly slower because the comparison logic is more complex to ensure linguistic accuracy.
utf8mb4_general_ci: Historically faster because the calculations are computationally cheaper. However, on modern CPUs, this performance gap is usually negligible.
3. Language Support
utf8mb4_unicode_ci: Better for applications serving a global audience or multiple languages, as it respects specific cultural sorting rules.
utf8mb4_general_ci: May lead to unexpected sorting results in non-English languages where accented characters or ligatures have specific meanings.
Comparison Table
Summary Recommendation
Use utf8mb4_unicode_ci
This is the modern standard. Unless you are working on a legacy system with extreme performance bottlenecks or specific hardware constraints, you should use
utf8mb4_unicode_ci(or even better,utf8mb4_0900_ai_ciif you are on MySQL 8.0+) to ensure data integrity and correct sorting across all languages.Use utf8mb4_general_ci
Only use this if you are maintaining an older database where this collation is already established, or if you have benchmarked your specific application and found that the collation overhead is causing a significant bottleneck.
Beta Was this translation helpful? Give feedback.
All reactions