UTF-8 text encoding and self-hosted PHP / MySQL web applications
One of the best things about web development is that there is always something new to learn, no matter how much you know. If you want to really learn how to build a solid, portable application then I’m convinced that the best way is to release one.
Users will be swift to report bugs and incompatibilities for you. This isn’t negligent; it’s natural. A developer can’t anticipate all possible configurations and behaviour. Your response is what matters. You may be equally swift to issue a fix, but issues may sometimes go unaddressed because the effort outweighs the benefit.
Issues with simpleContact have been rare, but a self-started forum thread caught my attention last week. A member commented that they hadn’t been able to submit their name properly. I’ve researched the matter and in the interests of transparency I’m going to present what I found and what I plan to do.
Character encoding is a thorny issue that few bother to understand and implement in a considered way. There are articles on the web that cover the subject in detail, so rather than re-hash their points I’ll add some relevant links at the end.
The issue
simpleContact uses MySQL’s default collation, which is latin1_swedish_ci – fine for English and most Western European languages, but other languages (e.g. Greek, Japanese, Russian etc) will not be stored or represented properly. This is a matter I wish to address because I want as many people as possible to use my software.
The ideal course of action would be to process and store data in UTF-8 encoding. UTF-8 is a multi-byte character encoding that supports the characters of just about every language in the world.
Advice abounds for how to reconfigure your server and scripts to use UTF-8, but matters are more complicated when users will host an application on their own site. As developers of apps like WordPress and Mint know, ideal configurations in shared hosting services and availability of non-default PHP extensions like mbstring can’t be guaranteed. Some of the servers I have access to lack the necessary functions.
The easiest thing would be to put up a wall and say “If you don’t have x, you need to get a better host”. Meanwhile back in the real world, users are far more likely to reject an app than swap hosts. A middle ground has to be reached.
The solution
UTF-8 support must be added, but not without respecting a user’s situation. I plan to keep Latin-1 encoding and collation as a baseline standard for compatibility. It is better to support a subset of commonly spoken languages than not to work at all. For users whose server supports mult-byte string functions, the database will be silently upgraded to UTF-8 and appropriate methods will be used in PHP.
Beyond that, there isn’t much else. If you want a truly international solution and your host won’t support mbstring then you really do need to choose another host.
So what next?
I have fixed the feature-set for simpleContact 2.0 Pro and don’t intend to add UTF-8 support to it. When 2.0 is done, it will become a high priority feature in version 2.1, which will be a free upgrade. I will add it to simpleContact Lite in due course.
Useful links
- Joel Spolsky – The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- Nick Nettleton – PHP UTF-8 cheatsheet
- Gabriel Walt – StumbleUpon PHP: [Tutorial] Character encoding
- Wikipedia – ISO 8859-1 (Latin-1) and UTF-8









February 16th, 2008 at 5:35 pm
UM dude there are libraries out there that implement the mbstring functions in pure php if it’s not present.. so use that.
February 17th, 2008 at 12:21 pm
Thanks Wesley. phputf8 looks promising - I’d be pleased to stand corrected on this one