Lessons Learned From Sending 7 Million Emails In 5 Days Using ColdFusion
At work, I was asked to send an email notification to every one of our users. When originally presented with this problem, I said it was a terrible idea—that there are professional services that do exactly this thing; and, that I have no experience sending a massive amount of emails; and, that anything I built would end up being the "Dollar Store" version of such a feature. None of this persuaded the powers-that-be. And so, I ended up building a small ColdFusion system that just finished sending 7 million emails in 5 days. It was a lot less complicated than I had anticipated; but, we learned a number of important lessons along the way.
Bounce Rate Is the Limiting Factor
Of all the concerns that I had, it turns out that the only one that really mattered was the validity of email addresses. If I send an email to an invalid email address, it "Hard Bounces". Meaning, the email cannot be delivered for permanent reasons.
In and of itself, a hard bounce is not problematic. Email addresses tend to go stale over time; so, it's expected that some number of deliveries will fail. Hard bounces only become a problem when they account for a large percentage of outbound emails.
To be honest, I don't fully understand how the "politics" of email work. But, email services live and die by reputation. And, when too many outbound emails come back as a hard bounce (or a spam complaint), it hurts the reputation of the email service. And, this can be a huge problem.
In fact, it's such a huge problem that Postmark—our email delivery service—will shut down our message stream if our bounce rate goes above 10%. They do this to protect us, as the customer sending emails; and, to protect other Postmark customers who might be negatively impacted if Postmark's overall reputation for delivery is tarnished.
To be clear, I absolutely love the Postmark service and have been using them both professionally and personally for the last 13 years. This is not a slight against them—they rock and they're just doing what they have to do.
What I didn't expect was just how many invalid email addresses we had collected over 13 years. Many of these were for users who had moved on to a different company (and whose old email address was no longer active). But, we also had about 150 thousand temporary email addresses (ie, email addresses that are generated for testing purposes and are only valid for about 15 minutes).
Aside: There is a community-driven List of Disposable Email Domains that currently has 3,614 different domains that can be used to generate temporary emails. This list can be used to help prevent users from signing up for your application with a disposable email.
This is why Postmark recommends that services only send an email to a customer that has been successfully contacted within the last 3 months. Of course, I didn't have that luxury - I had to send an email to everyone.
Validating Email Addresses
In order to keep our bounce rate below the 10% cut-off, we started using a service called NeverBounce. NeverBounce provides an API in which you give it an email address and it verifies the validity of said email address using a variety of techniques.
At first, I tried to integrate a NeverBounce API call as a pre-send validation step in my ColdFusion workflow. Meaning, right before sending an email, I would first submit the email address to NeverBounce and examine the response. But, this turned out to be prohibitively slow. NeverBounce validates email addresses, in part, by making HTTP requests to the target email server. As such, any latency in those upstream requests are propagated down to my ColdFusion workflow. And, some of those requests would take hundreds of seconds to complete.
Once we realized that a just-in-time validation step wasn't going to work, we started uploading text files though the NeverBounce admin. Each text file contained a list of 25 thousand email addresses. And, once uploaded to NeverBounce, each file would be processed over the course of several hours and a results file would (eventually) be made available for download.
Unfortunately, half way through this process, NeverBounce locked our account do to "suspicious activity". To unlock it, they wanted our head of marketing to send in a photo of himself holding his passport next to his face. He never did this (because what kind of a crazy request is that?!); and, after several weeks of back-and-forth support requests, they finally unlocked the account.
That said, NeverBounce eventually stopped responding to subsequent support requests regarding future uploads. As such, we finished the validation process with a different service, Email List Verify. Email List Verify was half the cost of NeverBounce; but, it was also less robust (with one big limitation being that it didn't support +
-style email addresses at the time of this writing - marking them all as having invalid syntax).
Eventually, we passed every email through one of these services and cached the responses in a database table that looked like this:
CREATE TABLE `email_validation_cache` (
`email` varchar(75) NOT NULL,
`textCode` varchar(50) NOT NULL,
PRIMARY KEY (`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
The textCode
column held the result of the validation call. It contained values like:
valid
- NeverBounce validated.ok
- Email List Verify validated.invalid
invalid_mx
ok_for_all
accept_all
email_disabled
disposable
In order to keep our bounce rate down, we only sent an email to addresses that had strict validation (ie, valid
and ok
). And, we skipped catch-all emails with responses like ok_for_all
and accept_all
, (pseudo code):
<cfscript>
// Process the next chunk of emails.
getNextChunkToSend().each(
( email ) => {
var textCode = getEmailValidationResult( email );
// If this email address failed verification (via NeverBounce or
// Email List Verify), skip it - we need to keep a low bounce rate.
if (
( textCode != "ok" ) &&
( textCode != "valid" )
) {
return;
}
// ... send email ...
}
);
</cfscript>
And, even after doing all of that, we still ended up with a bounce rate of 4.2%. This might sound low; but 4.2% represents a lot of bounced emails when you consider that we sent 7 million emails overall. But, at least all of these precautions kept us well under Postmark's 10% cutoff.
Email Delivery Has to Be Throttled
Normally, when I have to perform a bulk operation in programming, I try to figure out how to execute that operation as fast as possible. Or, at least as fast as I can without overwhelming any upstream services (such as the database or a file system). With email delivery, more caution has to be taken. Because of the bounce rate problem; and, because of the delivery reputation problem; emails have to be sent with moderation.
That said, "moderation" is a fuzzy science here. According to Postmark's best practices:
To help prevent ISPs, like Gmail, from thinking your emails are seen as spam we recommend sending your bulk Broadcast messages in smaller batches. We know, we know...this can seem a bit strange since isn't the point of sending in bulk mean sending many messages at once? Well, yes for sure but to a point.
Sending 200K - 300K messages out of the blue all at once is extreme for any sender and doing so can lead to delivery issues. Overwhelming an ISP with many incoming messages at once is a sure-fire way to see bounces, transient errors, and more.
For your first Broadcast send through Postmark we ask you to stick with batches of 20K messages per hour, for the first 12 hours to give receivers time to see engagement before ramping up.
After that we recommend breaking up any sends of 50,000+ recipients into small batches, spacing each out by 30 minutes or so. Taking this a step further, you can separate your sends into smaller segments of recipients based on recipient timezone or newest customers vs oldest customers.
My read from this - which has subsequently been validated with some Postmark customer support - is that when we send a massive amount of broadcast emails, we need to start with 20k/hour; and then, we can gradually ramp-up to sending about 100k/hour, assuming that everything is looking good and healthy and our bounce rate is staying low. Which means, I needed a way to dynamically adjust the processing rate of my ColdFusion workflow.
After considering a few different approaches, I settled on a relatively simple technique using a ColdFusion scheduled task that would execute every 5 minutes. Each time this scheduled task executes, it would attempt to send N-number of emails. And, with the task executing every 5 minutes, I know that it will execute 12 times in one hour; which makes this relatively easy to math-out:
- 0k emails per task → 0k emails per hour.
- 1k emails per task → 12k emails per hour.
- 2k emails per task → 24k emails per hour.
- 3k emails per task → 36k emails per hour.
- 4k emails per task → 48k emails per hour.
- 5k emails per task → 60k emails per hour.
- 6k emails per task → 72k emails per hour.
- 7k emails per task → 84k emails per hour.
- 8k emails per task → 96k emails per hour.
To make this dynamic, I stored the sendLimit
as meta-data on the scheduled task (we manage scheduled tasks using a database table). And, I created a small administrative UI in which I could update the sendLimit
meta-data at any time. If I set the sendLimit
to 0
, the scheduled task execution becomes a no-op. And, if I set the sendLimit
to 8000
, the scheduled task execution can send 100k emails per hour.
Our timeline for sending looked (something) like this:
- Send 10 emails every execution for 20-mins.
- Send 100 emails every execution for 30-mins.
- Send 1,000 emails every execution for 30-mins.
- Send 2,000 emails every execution for 30-mins.
- Send 3,000 emails every execution for 30-mins.
- Send 4,000 emails every execution for 12-hours.
- Send 5,000 emails every execution for 2-hours.
- Send 6,000 emails every execution for 2-hours.
- Send 7,000 emails every execution for 2-hours.
- Send 8,000 emails every execution for rest of send.
We gradually ramped-up the send in order to observe bounce rates (and react if necessary); and, to allow remote email systems time to absorb the increasing load, as per Postmark's best practices suggestions.
This ended up being a perfectly good solution. Very simple, very procedural, very easy to understand. All of the potentially complex mechanics around governing the throughput of the workflow were replaced with a series of brute-force intervals.
ColdFusion and SMTP Were Not the Limiting Factors
Because, of course they weren't!
ColdFusion gets a bad rap. And, the SMTP protocol gets a bad rap. But, I used both of these things to send 7 million emails and they both performed exceedingly well. In fact, even when my scheduled task was sending 8,000 emails at a time, our Lucee CFML task runner accomplished this in just over a minute.
And, this is without spooling enabled. Meaning, instead of using the CFMail
tag to "fire and forget", the CFML thread must block-and-wait until the underlying email is delivered to Postmark (for subsequent processing).
If I were delivering these emails in series (ie, one after another), this would have taken a lot longer. But, ColdFusion has a variety of techniques for extremely simple parallel processing. In this case, I used the parallel array iteration feature with a maximum of 15 parallel threads (pseudo code):
<cfscript>
// Get next block of email addresses to process during this execution of
// the scheduled task.
chunk = getNextChunkToSend( 8000 );
// Process the block of emails in parallel.
chunk.each(
( email ) => {
emailer.sendEmail(
email = email,
async = false // Disable spooling.
);
},
true, // Parallel iteration.
15 // Max thread count.
);
</cfscript>
This blazed through each block of emails leaving me with several minutes of wiggle room in between each scheduled task execution. ColdFusion is a power-house of an application runtime.
Preparing for 24-Hours of Continuous Sending
In order to send 7 million emails within a reasonable time-frame, we had to send emails continuously throughout the day—all day and all night. But, of course, we had the threat of a 10% bounce rate looming over us. As such, I had to build an automatic kill-switch into the workflow.
The Postmark API is robust; and, among the many resources that it exposes, it provides a "statistics" end-point. This allows me to use a message stream ID and get back high-level stats about mail delivery on that message stream. This includes the bounce rate.
At the top of each execution of the ColdFusion scheduled task, I added an API call to Postmark to read the bounce rate. And, if the bounce rate ever hit 5% (half of Postmark's threshold), the application would automatically set the sendLimit
to 0
, effectively halting the process (pseudo code):
<cfscript>
sendLimit = getSendLimit();
// If the send limit has been set to zero, there are no emails to process
// during this task execution, short circuit the workflow.
if ( ! sendLimit ) {
return;
}
bounceRate = getPostmarkBounceRate();
// If Postmark ever reports that the bounce rate on this message stream
// is creeping up, override the send limit and short circuit the workflow.
// This will effectively halt this and all future sends until we get a human
// in the loop to evaluate next steps.
if ( bounceRate >= 5 ) {
setSendLimit( 0 );
return;
}
// Get next block of email addresses to process during this execution of
// the scheduled task.
chunk = getNextChunkToSend( sendLimit );
// ... rest of code ...
</cfscript>
With this pre-flight check in place, I could safely go to bed at night and not worry about waking up to a deactivated Postmark account.
Thankfully, our bounce rate never went above 4.9%; and, ultimately ended at 4.2% once the email campaign was over.
Cross-Pollinate Suppression Lists When Planning a Send
Postmark is organized around "servers". And, each server is organized around "message streams": transactional (or outbound), inbound, and broadcast. Each one of these message streams tracks its own suppression list. A suppression is automatically created whenever an email bounces back or a recipient unsubscribes from the given message stream. If you try to send an email to an address that is currently being suppressed, Postmark will ignore your request.
Note: I am not sure if this "ignored request" is counted as part of the overall bounce rate of the message stream.
But, suppression lists are message stream specific. Meaning, if a given email is suppressed on the "transactional" stream, Postmark will still send an email to said address if you send the email on a different "broadcast" stream. Which means, any "hard bounce" suppressions in one stream don't get cross-pollinated with other streams.
I had to perform this cross-pollination programmatically. Thankfully, Postmark provides an API end-point that offers a bulk export of suppressions on a given message stream. I was able to use this API to compile a list of "hard bounce" suppressions from all of our message streams. I then used this compilation to further populate the email_validation_cache
database table with more records.
Essentially, every email address with a "hard bounce" suppression was added to the email_validation_cache
table with a textCode
of suppressed
. This way, as I was looping over the emails (in each scheduled task execution), these suppressed
emails would get implicitly skipped—remember, I was only sending to emails with a textCode
of valid
(NeverBounce) or ok
(Email List Verify).
This cross-pollination of suppressions helped keep our overall bounce rate below the kill-switch threshold.
Plan for Failure and Be Idempotent
When processing data—especially a lot of data over a long period of time—it's critical to assume that failures will occur. This might be failures due to programming logic; or, it might be failures due to the fact that Amazon AWS suddenly doesn't want your pod to be running anymore. In any case, it's important that you plan to fail; and, more to the point, that you plan to recover.
To foster an idempotent email workflow, I recorded the delivery of each email; and, checked for an existing delivery prior to each send. This way, if the processing of a block of emails stopped half way and needed to be restarted, it would skip the first half of the block and starts sending the second half of the block (pseudo code):
<cfscript>
// ... rest of code ...
// Get next block of email addresses to process during this execution of
// the scheduled task.
chunk = getNextChunkToSend( sendLimit );
// Process the block of emails in parallel.
chunk.each(
( email ) => {
// ... validate email code ...
// If the given email address has already been notified, something
// went wrong. Skip this email and move onto the next one.
if ( hasExistingNotification( email ) ) {
return;
}
// Record this notification for future idempotent check.
recordNotification( email );
emailer.sendEmail(
email = email,
async = false // Disable spooling.
);
},
true, // Parallel iteration.
15 // Max thread count.
);
</cfscript>
Notice that at the top of each iteration, I'm checking to see if a notification record has already been recorded. And, if so, I short-circuit the processing of that email address. If not, I immediately record the new email.
The order of operations matters here. I have chosen to do this:
- Record notification.
- Send email.
But, I could have chosen to do this:
- Send email.
- Record notification.
The order of operations determines the idempotent characteristics of the workflow. Consider what happens in each case if the server suddenly crashes in between steps 1 and 2.
In my case, where I am recording the notification first, I am creating an "at most once" delivery mechanism. Which means, there is a possible failure case in which I never send an email to the given address.
If I were to record the notification second, I would have created an "at least once" delivery mechanism. Which means, there is a possible failure case in which I send the same email twice to the given address.
I chose to implement an "at most once" strategy because—at least in this particular case—I would rather err on the side of fewer emails. This isn't the "right" or "wrong" answer; it was just a judgement call on my part (and takes into account information that is not discussed in this blog post).
In the end, looking through the error logs, it looks like the workflow never actually crashed or was interrupted in any way. So, this idempotent implementation was never actually called to task.
Bringing It All Together
All of this makes sense to me in retrospect. But, it took weeks to put together; included a number of teammates all working in different services to compile and verify emails; entailed a lot of trial and error; and, benefited from Postmark's very helpful Support staff. And, in the end, the resultant workflow is—I think—as simple as it possibly can be given the constraints.
To bring it all together, here's the pseudo code for the one method that processes each ColdFusion scheduled task execution:
<cfscript>
function sendNextBatchOfEmails() {
var sendLimit = getSendLimit();
// If the send limit has been set to zero, there are no emails to
// process during this task execution (possibly due to a bounce rate
// kill switch), short circuit the workflow.
if ( ! sendLimit ) {
return;
}
// If Postmark ever reports that the bounce rate on this message stream
// is creeping up, override the send limit and short circuit the
// workflow. This will effectively halt this send and all future sends
// until we can get a human in the loop to evaluate next steps.
if ( getPostmarkBounceRate() >= 5 ) {
setSendLimit( 0 );
return;
}
// Get next block of email addresses to process during this execution of
// the scheduled task. PROCESS THE BLOCK OF EMAILS IN PARALLEL for best
// performance - we have to fully process this entire block during the
// 5-min window of this scheduled task.
getNextChunkToSend( sendLimit ).each(
( email ) => {
var textCode = getEmailValidationResult( email );
// If this email address failed verification (via NeverBounce
// or Email List Verify), skip it - we need to keep a low
// bounce rate.
if (
( textCode != "ok" ) &&
( textCode != "valid" )
) {
return;
}
// If the given email address has already been notified,
// something went wrong (possible failure mid-execution). Skip
// this email and move onto the next one.
if ( hasExistingNotification( email ) ) {
return;
}
// Record this notification for future idempotent check.
recordNotification( email );
emailer.sendEmail(
email = email,
async = false // Disable spooling.
);
},
true, // Parallel iteration.
15 // Max thread count.
);
}
</cfscript>
This pseudo-code leaves out some application-specific details; like the fact that there's a "job" record for the overall notification workflow. But, the high-level concepts are all there; and, even in my internal version of the code, it looks and feels this simple—this top-down.
I didn't use any fancy technology. No message queues. No lambda functions. No array of "workers". No event-driven architecture. Just ColdFusion, a scheduled task, and a MySQL database to record validations and notifications. And, in 5 days this code sent close to 7 million emails without incident.
I'm actually quite proud of this endeavor. And, I'm thrilled that all of the old, boring technologies that I use were more than up to the task.
Want to use code from this post? Check out the license.
Reader Comments
Hi Ben!
Very nice work once again and thank you for sharing it all.
I would be curious to know if specialized platforms for sending emails would have agreed to have more than 7 million unfiltered contacts and how much it would have cost monthly for the subscription and/or sending the emails. That's an impressive quantity of emails!
@Tony,
Yeah, it's a very interesting question. Originally, they did look at using an external service (several actually); and, they were all going to be like tens-of-thousands of dollars to send this volume of email. Now, to be clear, by building it internally, it still cost money. Obviously, there's my wages and the opportunity cost to build something else during that time. But, it also cost money to use NeverBounce and EmailListVerify to pre-check the emails. But, those services were in the single-digit thousands, I think.
In the end, I think it was a net benefit in terms of cost. But, some of it is so hard to quantify.
We all have to deal with mass email notifications at some point... This is very good stuff Ben! Thank you for sharing your experience (as always). What did you use as your reply to? I've always used my email address because I hate noreply@ based senders and prefer to let them reply and actually get back to me. But, I fear this is adversely affected the reputation of my email address and I've recently been rethinking the strategy.
@Chris,
Another great question - we decided on using the
support@
email address. This is the same email that our normal support ticket process uses. And, I think that when you send to that email address it automatically generates a Zendesk ticket. As such, they have a bunch of logic and filtering in there somewhere to dismiss auto-responders and "Out of Office" replies. To be honest, I know very little about the Support system; but, I know that the Support team was OK to do it this way (or at least, that they would report back if things started to go sideways).It was interesting to read how you handled this task, Ben. I've been sending 3 million emails per month for years (sometimes 200K in an hour) and the biggest problem we have is rare but devastating. It's when a major email provider or a big ISP suddenly flips and bounces our emails as spam. The only solution for us was to contact the ISPs and explain the nature of the emails (it wasn't marketing content, it was govt related) and thankfully they eventually whitelisted our external mail servers IP. It took a massive effort every time, but one by one we got there.
I coded our email send routing in the CF8 days and it's stood the test of time... with a few extra scheduled scripts to monitor the spool size and the spool service itself as precautions.
@Gary,
That sounds very impressive! We were extremely nervous to get even close to 100K an hour, let alone double that; and, as you saw in the post, we did that fairly incrementally. It's funny though - in our pre-send team huddle, when we were talking about what happens if we get blocked by an ISP, someone said, "We'll cross that bridge when we get to it."; and I was like, "To be clear, if we get there, there is no bridge to cross" 😂 They say that "hope is not a plan" ... well, I was definitely hoping that no one ended up blocking us 😨 But, it's good to know that there is a possible path forward even if that happens.
This was just the first send. We'll have to send this email a few more times (though, to less people each time as we filter-out hard-bounces and unsubscribes). I really do hope that we don't get screwed-over at some point.
As a follow-up post, I wanted to look at how I've improved my email validation cache logic to make sure that once an email address has been suppressed (ie, that a user unsubscribed from the Postmark broadcast stream) that their email isn't targeted again in the future:
www.bennadel.com/blog/4590-conditionally-updating-columns-when-using-on-duplicate-key-update-in-mysql.htm
Basically, I had to update my
INSERT...ON DUPLICATE KEY UPDATE
logic that populates myemail_validation_cache
table to conditionally update thetextCode
column under certain circumstances.We're now in the middle of our 4th mass mailer at work, and I've continued to learn some lessons. This morning, I shared how I've been using the Postmark Bounces API to find previously failed emails and exclude them from subsequent mailings:
www.bennadel.com/blog/4636-paginating-the-postmark-bounces-api-in-coldfusion.htm
In our particular case, since we're sending out public service announcements - ie, not mission critical information - I'm erring on the side of aggressively removing emails from our internal list. Better safe than sorry (in terms of bounce rates).
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →