VARCHAR(Length) Limit Refers To Characters, Not Bytes, In MySQL
When you define a varchar
field in a MySQL database table, you can provide a length limit, ex varchar(255)
. For as long as I can remember, I thought this limit referred to the number of bytes that could be stored in the field. And, to be fair, back when I was only consuming ASCII characters, this assumption was coincidentally true—one ASCII character is represented by one byte. In reality, this length limit refers to the number of characters that can be stored in a field, regardless of how many bytes are needed to represent said character string.
This is an important distinction to understand because once you go beyond the basic ASCII character-set, you start dealing with variable-length character representations. That is, you start dealing with characters, such as emojis, that are represented under the hood by a series of bytes, not just a single byte.
Fortunately, if you define a varchar
field to have a limit of 255
, it doesn't matter which characters you insert into that field. If you insert 255 ASCII characters, it may only require 255 bytes of storage. And, if you insert 255 emoji characters, it may require far more than 255 bytes; but, you won't have to know or care about that (for the most part).
To see this in action, let's define a simple MySQL database table with a single varchar
field:
CREATE TABLE `varchar_test` (
`id` tinyint unsigned NOT NULL,
`value` varchar(10) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
This varchar
field can hold 10
characters, regardless of how many bytes that requires. To demonstrate, let's use ColdFusion to insert two rows: one with 10 ASCII characters and one with 10 multi-byte emojis:
<!--- Define 10-characters of 1-byte ASCII data. --->
<cfset asciiData = repeatString( "A", 10 ) />
<!--- Define 10-characters of multi-byte UTF-8 data ("raising hands" emoji). --->
<cfset emoji = canonicalize( "&##x1f64c;", true, true ) />
<cfset emojiData = repeatString( emoji, 10 ) />
<!--- Insert both character-strings into database using VARCHAR(10). --->
<cfquery name="results">
TRUNCATE TABLE
varchar_test
;
INSERT INTO
varchar_test
SET
id = 1,
value = <cfqueryparam value="#asciiData#" cfsqltype="cf_sql_varchar" />
;
INSERT INTO
varchar_test
SET
id = 2,
value = <cfqueryparam value="#emojiData#" cfsqltype="cf_sql_varchar" />
;
</cfquery>
When we run this ColdFusion code, it runs without error. And, once the data is stored, we can look in the MySQL database to see how it's being represented:
SELECT
id,
value,
-- Get the BYTES required to store the value.
LENGTH( value ) AS byteLength,
-- Get the CHARACTERS required to store the value.
CHAR_LENGTH( value ) AS charLength
FROM
varchar_test
;
And when we run this MySQL, we get the following results:
Both the ASCII data and the emoji data have been stored successfully. And, both report back as 10 characters according to the CHAR_LENGTH()
function. But, while the ASCII value has a byte length of 10, the emoji value has a byte length of 40—four times the limit of the field. Clearly the varchar(length)
limit refers to the characters, not the bytes.
Which is great because it means that, as application developers, we don't have to worry so much about what data the user is entering.
In fact, it seems that the maxlength
field on a <textarea>
control is even more restrictive. So, if we have a textarea with a maxlength="10"
, we may not be able to enter 10 emojis into that field even if we can store 10 emoji characters in the underlying database.
To see this in a more full-circle way, I'm going to query for the two records we inserted above; and then, try to copy-paste the values into a textarea control with a maxlimit
:
<cfquery name="results" returnType="array">
SELECT
id,
value
FROM
varchar_test
;
</cfquery>
<!--- Ensure the browser will render non-ascii character-sets. --->
<cfcontent type="text/html; charset=utf-8" />
<cfoutput>
<dl>
<div>
<dt>
<strong>Database ASCII:</strong>
</dt>
<dd>
#results[ 1 ].value#
</dd>
<dd>
#len( results[ 1 ].value )# chars
</dd>
<dd>
#arrayLen( charsetDecode( results[ 1 ].value, "utf-8" ) )# bytes
</dd>
<dd>
<input maxlength="10" />
</dd>
</div>
<div style="margin-top: 1rem ;">
<dt>
<strong>Database EMOJI:</strong>
</dt>
<dd>
#results[ 2 ].value#
</dd>
<dd>
#len( results[ 2 ].value )# chars
</dd>
<dd>
#arrayLen( charsetDecode( results[ 2 ].value, "utf-8" ) )# bytes
</dd>
<dd>
<input maxlength="10" />
</dd>
</div>
</dl>
</cfoutput>
If I run this ColdFusion page and then copy-paste the rendered database value into the relevant <textarea>
, I get the following output:
Notice that while I can copy-paste the entire ASCII value into the <textarea>
, I can only copy-paste half the emoji value into the <textarea>
despite it having the same "max length" as the underlying MySQL database field. Notice also that ColdFusion seems to think that the 10 character emoji value is actually 20 characters (which is apparently what the browser thinks as well, hence the truncation).
Ultimately, the browser is more restrictive than the MySQL database. Which means that if we use the same maxlength
in the browser as we do varchar(length)
in the database, we should never have to worry about the user entering more text information that the given row can store.
Want to use code from this post? Check out the license.
Reader Comments
Over on Facebook, Seb Duggan had a good point to watch out for the multi-emoji patterns. Where two or more emojis are combined to visually represent a single glyph. I don't know much about these - if they're the same behavior as special ligature behaviors? But, regardless, if I were to replace my
emoji
in my demo with a multi-emoji glyph, the demo would result in an field overflow / truncation.That said, the
<textarea maxlength>
would also take this into account. So, as long as you have amaxlength
that is less than or equal to thevarchar(length)
constraint, you should be fine.Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →