Skip to main content
Ben Nadel at CFUNITED 2008 (Washington, D.C.) with: Sean Corfield
Ben Nadel at CFUNITED 2008 (Washington, D.C.) with: Sean Corfield

MySQL Truncating Value At First Supplementary / Astral / utf8mb4 Character On INSERT And UPDATE

By
Published in , , Comments (1)

As I've been embracing [one of] my roles as the Database Administrator's understudy at InVision App, I've been trying hard to track down every vexing problem relating to our MySQL RDS instances. For months, we've been having a problem where data, coming out of a Node.js microservice, was being truncated on the first character that required the utf8mb4 character set. While I don't yet understand why our ColdFusion microservices and our Node.js microservices are handling this differently (I suspect the Java driver is just better than the Node.js driver), I think I finally figured out why Node.js is able to insert corrupted data: we're using a lenient "sql_mode" setting.

For context, this is for data going into a database table that only supports the utf8 character set, not the utf8mb4 character set (which requires up to four bytes for each code-point, using a pair of high / low surrogate characters). When the ColdFusion MySQL 5 driver goes to insert data that contains supplementary / astral plane characters into this table, it errors out with a message like:

Error Executing Database Query. Incorrect string value: '\xF0\x9F\x98\x8D, ...' for column 'value' at row 1.

Now, on the other hand, when the Node.js MySQL driver goes to insert the same data, there is no error. The INSERT succeeds; but, the data becomes corrupted, truncating the input at the same character that would have caused an error in the ColdFusion MySQL driver.

NOTE: The fact that we have two "microservices" accessing the same database is not lost on me; it makes me sad.

I don't know why the ColdFusion driver works better than the Node.js driver (except for the fact that ColdFusion is amazing). It could be something in the connection string. But, even when I have what I believe to be the same connection string in both drivers, the problem still persists.

That said, after much head-banging, experimentation, and Googling, I finally came across the blog post, "Data Corruption To Go: The Perils Of sql_mode = NULL," by Keyur Govande. This post cued me in on the concept of "sql_mode," which I had never heard of before. SQL modes affect the SQL syntax that MySQL supports and the data validation checks that it performs during mutation events.

By default, MySQL uses a sql_mode named "NO_ENGINE_SUBSTITUTION". This is a non-strict mode in which problematic data produces "warnings" not "erros". This is what one of our MySQL RDS instances is using. And, it's the reason data is being corrupted. To see this in action, we can set the current session to use NO_ENGINE_SUBSTITUTION and then try to insert problematic data:

-- This mode produces warnings, not errors for certain problematic
-- operations.
SET SESSION sql_mode = 'NO_ENGINE_SUBSTITUTION';

-- Try to insert data that requires the utf8mb4 character set (into
-- a table that only supports utf8).
INSERT INTO
	char_test
SET
	id = 100,
	value = 'ha ha <?> you crazy!'
;
show warnings;

-- Check to see what value was inserted.
SELECT
	value
FROM
	char_test
WHERE
	id = 100
;

When we run the above SQL, we get the following output:

sql_mode NO_ENGINE_SUBSTITUTION

As you can see, the inserted value is being truncated on the first occurrence of a character that requires utf8mb4. And, if run:

SHOW WARNINGS;

... we get the following:

Warning 1300 - Invalid utf8 character string: 'F09F98'
Warning 1366 -Incorrect string value: '\xF0\x9F\x98\x81 y...' for column 'value' at row 1

In the non-strict "NO_ENGINE_SUBSTITUTION" mode, we get warnings about the problem; but, that doesn't prevent the INSERT for taking place.

Now, if we switch over to "STRICT_ALL_TABLES" mode, which rejects invalid data, we start to see the desired error behavior:

-- This mode produces warnings, not errors for certain problematic
-- operations.
SET SESSION sql_mode = 'STRICT_ALL_TABLES';

-- Try to insert data that requires the utf8mb4 character set (into
-- a table that only supports utf8).
INSERT INTO
	char_test
SET
	id = 200,
	value = 'ha ha <?> you crazy!'
;

-- Check to see what value was inserted.
SELECT
	value
FROM
	char_test
WHERE
	id = 200
;

When we run the above SQL, we get the following output:

sql_mode STRICT_ALL_TABLES

As you can see, this time, the INSERT fails completely and throws the error:

Error : Incorrect string value: '\xF0\x9F\x98\x81 y...' for column 'value' at row 1

This is definitely what we want. Even if the underlying problem is really that our data table has problematic character set support, we'd rather have the statements fail entirely rather than "looking" like they worked and quietly resulting in corrupted data.

I don't know what the broader implications of changing the sql_mode may be (I'm just reading up on all of this for the first time). I suspect that this can only result in better application behavior. But, I'll definitely be reviewing it with the team before we take any action. As a final note, I would highly suggest reading Keyur Govande's blog post as it talks about a number of other problems that non-strict SQL modes can cause.

Want to use code from this post? Check out the license.

Reader Comments

15,841 Comments

@All,

Note: my blog doesn't support utf8mb4 characters either :D So, my SQL snippets are using <?> instead of the actual emoticon literal values.

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel