MySQL character encodings

Saturday, 20 June 2009

I recently noticed that many of the comments and trackbacks on this website were composed entirely of question marks. At first I thought it might be plain old spam, but it turned out to be a character encoding problem. Here’s how I fixed it.

Excessive question marks often indicate a problem with character encoding. After a little investigation I realised that the character collation for most of the WordPress database tables was set to latin1_swedish_ci, indicating a character set of latin1. Clearly this was going to cause problems for non-western languages; I noticed problems with comments from Israel, Eastern Europe and East Asia amongst others.

The fix was pretty simple: change the character set used by the comments table to UTF-8 like this.

ALTER TABLE wp_comments CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

I think this happened when I restored the database from a backup; the collations were reset to latin1_swedish_ci, which is the MySQL default. Sadly, the encoding problem was happening when data went into the database, so all the current question marks will remain. At least future comments and trackbacks will now appear correctly, I hope.

There is one strange thing though — the comments table in another blog of mine has its collation also set to latin1_swedish_ci, but it can handle Japanese, Hebrew, Russian, and anything else I throw at it. The case continues.

Tags: ,

8 comments

You can leave a comment, or trackback from your own site.

  1. Hello mate,

    I like this trail of thought. Ive encountered the problem from another angle. When i update my wordpress blog with an article passed through the xmlrpc.php api the character encoding doesnt work for characters like ” and – instead returning html code or broken characters. The funny thing is the post titles work fine in wp-admin but on the theme front end they are broken… maybi the solution lies in the php code used for displaying post titles in the admin section.. because it appears to convert any type of character encoding back to what its supposed to look like.

  2. That is a bit mysterious. At least it seems that the characters are OK in the database, since they at least sometimes appear OK! It looks as if there might be a problem with your theme…

  3. Re: there is a plugin called wp-typography. If you know a little php i suggest you use it as a little modding will have this plugin fixing any character display problems 🙂

  4. @Frank, thanks for the pointer. Unfortunately, wp-typography wouldn’t have helped with my problem, because it only handles character display and requires that all text is encoded as UTF-8. My problem was that the characters were not correctly encoded in the first place.

    Looks like an interesting plugin though!

  5. i have similar problem but not in comments.
    In my wordpress post http://oscience.info/mathematics/introduction-to-set/ i have written some characters.
    But when i view my post from any browser then ? is displayed instead of the required character!
    Can you help me please?

  6. I solved the problem by typing the html code which i got from site http://www.w3schools.com/TAGS/ref_symbols.asp instead of using inbuilt wordpress character typing option.

Leave a comment