Shared-Array Iteration Can Cause Thread Deadlocks In Lucee ColdFusion 5.2
Yesterday, as we migrated some ColdFusion code from the Adobe ColdFusion 10 engine over to the Lucee ColdFusion 5 engine, we saw a lot of JVM threads enter a BLOCKED state once the code was put under load. Upon investigation of the corresponding thread dumps, all of the BLOCKED threads appeared to be in a deadlock while trying to iterate over an empty Array. The root cause of the problem was the difference in the way in which each of the ColdFusion engines handles array-passing: by-value vs. by-reference. What was an isolated array in Adobe ColdFusion suddenly became a shared array in Lucee ColdFusion. And, as it turns out, attempting to iterate over a shared array reference in Lucee can cause deadlocks under load.
One of the big differences between Adobe ColdFusion and Lucee ColdFusion is the way in which Arrays are passed around. In Adobe ColdFusion, arrays are passed-by-value; meaning, a new shallow copy of the array data-structure is created whenever the array is assigned to a new variable or passed out of context. Conversely, in Lucee ColdFusion, arrays are passed-by-reference; meaning, only the pointer to the data-structure is copied whenever the array is assigned to a new variable or passed out of context. This means that what was not a shared memory space in Adobe ColdFusion can become a shared memory space in Lucee ColdFusion.
Generally, when dealing with shared memory space access and mutation, you need to take care to provide some sort of synchronization (ie, locking) around the data operations. For me, this lesson always becomes muddy when we start talking about read-only data. Meaning, a shared data structure that is read-from but never written-to. In such cases, I tend to believe that synchronization is unnecessary since no state is being changed. But, what I failed to understand in this particular case was that the very act of reading the data actually changed the internal - albeit encapsulated - state of the data-structure. And, it was this change in shared state that caused the deadlock.
To make this concrete, I had a ColdFusion component that kept an empty Array in the variables scope. This empty array was intended to provide some semantic documentation in subsequent method calls:
component {
public any function init() {
// This empty array is here to provide some semantic documentation to the URL
// generation calls. The very name of the variable will provide far more meaning
// than an in-line empty array definition.
variables.emptyQueryParameters = [];
}
// ---
// PUBLIC METHODS.
// ---
public string function getThisUrl() {
return( generateUrl( "/this", emptyQueryParameters ) );
}
public string function getThatUrl() {
return( generateUrl( "/that", emptyQueryParameters ) );
}
// ---
// PRIVATE METHODS.
// ---
private string function generateUrl(
required string resource,
required array queryParams
) {
// ...
for ( var queryParam in queryParams ) {
// ...
}
// ...
}
}
As you can see, the shared-value, "emptyQueryParameters", was an empty array that served no real purpose other than to define and fulfill a required argument for subsequent method invocations. In other words, it was there to make it clear to future developers why an empty-array was being passed with various method calls.
In Adobe ColdFusion, this worked perfectly well as the "emptyQueryParameters" variable was passed-by-value. However, in Lucee ColdFusion, this "emptyQueryParameters" variable became pass-by-reference. And, under load, it lead to stack-traces that looked like this:
"http-apr-8500-exec-245"
java.lang.Thread.State: BLOCKED
at java.util.Vector.size(Vector.java:318)
at lucee.runtime.type.util.ListIteratorImpl.hasNext(ListIteratorImpl.java:75)
at lucee.runtime.type.util.ListIteratorImpl.next(ListIteratorImpl.java:103)
at services.thinger_cfc$cf.udfCall3(/invision/services/Thinger.cfc:971)
at services.thinger_cfc$cf.udfCall(/invision/services/Thinger.cfc)
at lucee.runtime.type.UDFImpl.implementation(UDFImpl.java:107)
at lucee.runtime.type.UDFImpl._call(UDFImpl.java:357)
at lucee.runtime.type.UDFImpl.call(UDFImpl.java:226)
at lucee.runtime.type.scope.UndefinedImpl.call(UndefinedImpl.java:803)
at lucee.runtime.util.VariableUtilImpl.callFunctionWithoutNamedValues(VariableUtilImpl.java:756)
at lucee.runtime.PageContextImpl.getFunction(PageContextImpl.java:1718)
at services.thinger_cfc$cf.udfCall3(/invision/services/Thinger.cfc:1019)
at services.thinger_cfc$cf.udfCall(/invision/services/Thinger.cfc)
As you can see, the JVM thread is blocked on a call to java.util.Vector.size(). What happened was that the shared array was being consumed in a for-in loop. And, under the hood, the ColdFusion for-in loop needed to check the size of the array, which is really a Vector of some sort. And, in doing so, it ended up deadlocking on some internal state of the underlying Java data-structure.
Eventually, after 600-seconds, Lucee ColdFusion would interrupt these threads and kill the hanging requests:
request ... has run into a timeout (600 seconds) and has been stopped.
The point of this lesson is a reminder to always lock access to data in a shared memory space even if the data-access appears to be read-only. What is "read-only" on the surface may not be read-only under the hood. And, unless a data-structure is inherently synchronized (ie, synchronized internally to its implementation), synchronization needs to be explicitly handled by the developer.
I'm writing this post in part because I received no search results when I attempted to Google for this type of stack-trace. But, I'm also writing this post in part to drill it into my head that shared-memory access needs to be synchronized. Always. Hopefully, this helps anyone else who runs into the same problem.
Want to use code from this post? Check out the license.
Reader Comments
Just a comment,
Lucee != ColdFusion
as ColdFusion is a brand name for Adobe. So more correctly (IMHO, YMMV)Lucee CFML
or justLucee
@Mark,
Ahhh, gotcha. Sorry, I'm relatively new to the Lucee world. I had thought that "ColdFusion" was being referred to as the language, and "Adobe" and "Lucee" were the platform implementations.
And, for what it's worth, we're already seeing some improvement performance patterns in Lucee! Woot woot! :party-parrot:
@Ben,
Nice one! The by ref does free up a lot of memory if you are using lots of arrays and structures (as they aren't being copied every time you pass them around)
I need to check but I thought there was a setting for function arguments like
by-value
or we could have talked about it at Railo/Lucee but the benefits outweigh the usefulness.When I first was exposed to this I was confused as I had mis-understood ColdFusion and thought it WAS by reference and surprised I was wrong.
Hey Ben, which exact version of Lucee where you using?
there are a lot of Lucee 5.2 releases and it's always useful to know in which release you found the problem
@Mark,
I totally agree. I think the pass-by-reference makes complete sense for Arrays. It is in alignment with my other language of choice, JavaScript :D
@Zachary,
Good question. When I dump out the
server
scope, I get,5.2.9.31
. That said, to be clear, I am not calling this a bug in Lucee -- I am calling this a problem with my use of shared-state without properly synchronization. I've historically thought about "Read only" as safe to do without locking; but, this makes it clear that even such a mentality is ill-fated in some circumstances.I am in the process of migration of ColdFusion to Lucee. If you can write a blog on the your process of migration that would be helpful.
I am happy with overall performance of the Lucee on our staging environment.
@Vikrant,
Right now, our process is just "doing it slowly" :D We did the migration locally, in the development environments. So, all the engineers were working with Lucee locally and then deploying to Adobe ColdFusion.
Which, of course, caused some problems since some non-compatible issues were only discovered once we went to Staging or Production (depending on how complicated the issue was).
In production, we then have a "Bug Crowd" environment that runs in Lucee where people can get paid to find bugs.
And, in our "normal production" environment, we have the load-balancer setup so that Lucee only receives like 2% of incoming HTTP traffic. This way, we limit the scope of ways in which Lucee will affect users. That's the stage we're at right now - this 2% of traffic phase. And, we're slowly finding and fixing problems as they show up in the error logs or are reported by Support.
To be clear, all the production stuff is over my head - that's all managed by the SRE team who understands how Kubernetes works.
Now, keep in mind, this path is relevant for our type of app and size of our team and infrastructure. This could be way over-kill for smaller apps. That said, I think it was great that our local development was running on Lucee many weeks before we moved any Lucee functionality to production.
@Ben,
Thank you for your comment and sharing your experience.
Last week we fully migrated to Lucee on production environment. Luckily so far we do not have any issues regarding Lucee server.
There were some setting issues we had earlier but it was resolved. Also their latest PDF version have some formatting issues, so we are running older version.
Waiting for release of 5.3.
@Vikrant,
We're getting much closer on our end as well! We just start rolling out some more Lucee pods. We had to roll back for a while because we were running into some issues with Arrays being passed by reference, which is a "breaking change" from Adobe ColdFusion's pass by value. We had to go back and try to audit the code to find place where this divergence in behavior would have a negative impact. But, I think we good now -- fingers crossed!
Hi Ben, so you solved this by wrapping cflock around the code which accesses shared memory spaces?
@Michael,
In the end, I just re-defined the empty-array wherever it was needed. So, instead of using something like:
... where I was trying to re-use the empty array reference, I just used an inline empty array declaration:
This way, there was no more shared memory access. This got rid of the race condition. It was a bit of premature optimization, I suppose, to try and share the empty-array declaration. I doubt the inline usage had any negative impact.