Now Using utf8mb4 To Enable Emoji In My Comments
This post is mostly here so that I can test the comment feature in production. I just updated my blog_comment
table to use the utf8mb4
character set, which means that it can now support Emoji characters:
ALTER TABLE `blog_comment`
MODIFY COLUMN `content` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
MODIFY COLUMN `content_markdown` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL
;
From what I understand, the utf8mb4
character set enables support for characters in the Astral Plane, which is where many (but not all) emoji characters live. Specifically, it supports Emjoi that require 4-bytes to encode.
On it's own, utf8mb4
doesn't enable Emoji support in my running MySQL server. I also had to tell the INSERT
statement to use the appropriate character set:
SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;
INSERT INTO blog_comment
(
content,
content_markdown,
-- .... truncated for demo ....
I believe that there is a way to configure the MySQL server to use those settings by default; however, that's beyond my experience at this point. So, rather than go down that rabbit-hole this morning, I'm just using SET NAMES
, which I know works.
And now, hopefully, I'll be able to add Emoji in the comments below!
Want to use code from this post? Check out the license.
Reader Comments
And then, there were emoji:
... and lo, it was good.
Hmm, the comment was inserted; but, the page hung. Trying again.
Huh, it looks like the key-casing on my returned Struct suddenly changed? That's..... wonky.
Well, anyway, it works now. 👍
❣
@Jamie,
Woot! It's alive!!
It looks like something in the last Adobe CF Hotfix messed with my key-casing.
I like! ⭐️😁✅👾💩🤡
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
^ I read this once a year to refresh my memory. If my understanding is correct, unicode is a "character set" and utf8mb4 is an "encoding."
Unicode has "code points" to represent 1,112,064 different characters across all the world's languages (and some random stuff like emojis too). utf8 can store most of the code points of unicode (using up to 3 bytes), and utf84mb can store them all (including emojis, using up to 4 bytes).
This is how you can say hello in Mandarin/UTF8/binary:
$ echo -n "你好" | xxd -b
11100100 10111101 10100000 11100101 10100101 10111101
因為我學中文和我住在台灣,我知道這個。😅
@Casey,
Ha, I know that I've read that article a handful of times as well :D Somehow, this stuff never quite plants itself permanently in my brain, and I have to re-learn aspects of this over and over. I think you are right about the "character set" vs. "encoding". ... I think 🤷♂️
Yay🎉🥳 ... Hmmm didn't like the party hat emoji?
@Chris,
Hmm, yeah, that second comes through as one weird little binary chip or something :D
It showed up on my laptop, but it's a weird chip on my phone...which is the device I used to comment. Strange. Here's a screenshot of it working...
https://screencast.com/t/T4bdQmQMJk81
Nice 🤓!
I posted about this a few weeks ago as well. It gives a bit of an explanation: https://www.petefreitag.com/item/895.cfm
From my experience you don't need to do the SET NAMES thing, you can instead alter the DB schema
ALTER TABLE tableName
DEFAULT CHARACTER SET utf8mb4,
MODIFY columnName varchar(200)
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
MODIFY anotherColumn text
CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;
And you also need to make sure your database driver is using utf8, for mysql that might be adding useUnicode=true to the connection string.
@Pete,
I modified the schema - it might be this piece that I am still missing:
Though, I feel like we tried that at work, and we still needed to include the
SET NAMES
stuff. I'll have to play around.@All,
As a quasi-follow-up to this post, I started to think more deeply about how I might further leverage Emoji in my blog. The native
chr()
andasc()
functions in ColdFusion only work for the Unicode range0x0000
to0xffff
(or, 0 to 65,535). Once you go above that into the "supplemental" character range, these functions no longer work. As such, I believe you have to dip down into the Java layer:www.bennadel.com/blog/3804-printing-emoji-characters-from-unicode-codepoints-in-lucee-cfml-5-3-5-92.htm
I know that, these days, you can embed Emoji directly within your CFML code. But, that just feels janky to me (probably cause I'm an old dog learning new tricks). As such, I'm really interesting in being able to render Emoji characters from the Unicode CodePoint.
@All,
Being able to store emoji is one thing; making the emoji easier to use is another thing. I want to add some emoji short-cuts to the blog. And, as a precursor to that, I needed to figure out how to render emoji glyphs using codepoints:
www.bennadel.com/blog/4084-rending-emoji-glyphs-using-hexadecimal-codepoints-in-javascript.htm
Turns out, the
String.fromCodePoint()
function is perfect for this.I finally got around to upgrading my MySQL database version and my table schemas. Now, everything runs on
utf8mb4
. And, as @Pete pointed out, I can remove theSET NAMES
stuff from my SQL statements as long as I have havecharacterEncoding=UTF-8
in my connection string. I also took this as an opportunity cleanup that connection string as well (as this tends to rot over time).www.bennadel.com/blog/4218-upgrading-my-coldfusion-blog-from-mysql-5-7-10-to-mysql-8-0-28.htm