Create A Running Average Without Storing Individual Values
I was thinking about how to be answer a new ColdFusion-based "Ask Ben" question about a rating system when I thought about creating numeric averages. All my life, when creating an average, I followed the simple formula of dividing the sum of a collection by the number of its entries:
Average = Sum / N
This, of course, requires you to have both the sum and the count of a collection of values. But what if we wanted to keep track of a set average as the set of values changed over time (such as the ratings in a rating system). Unless I was storing user-specific data with each individual rating, I wouldn't want to have to store each rating individually. So this got me thinking about weighted averages. I wondered, would it be possible to augment an average simply by averaging a new value into the average in a weighted manner. Meaning, given an average value based on N numbers, could I simply average in a new number by giving the current average a weight of N and the new number a weight of 1?
I am sure that anyone with a decent background in math is reading this and saying, "D'uh," but to me, this question was not immediately obvious. As such, I ran some tests to see if this worked:
<!---
Create an array in which to hold our original set or
random numbers (for which we will be finding an average).
--->
<cfset randomNumbers = [] />
<!---
Now, let's create some random numbers to store in the array.
We're going to keep the number relatively small since a small
set will be influenced more per each numbers.
--->
<cfloop
index="index"
from="1"
to="10"
step="1">
<!--- Add a new, random number to the collection. --->
<cfset arrayAppend(
randomNumbers,
randRange( 1, 10 )
) />
</cfloop>
<!---
At this point, we have our number collection. Let's figure
out the average of this collection, before we add anything
new. (NOTE: getting count for use later).
--->
<!--- Get the number of random numbers we created. --->
<cfset randomCount = arrayLen( randomNumbers ) />
<!--- Get the current sum of the collection. --->
<cfset baseAverage = arrayAvg( randomNumbers ) />
<!---
Now, let's create a new random number that we want to use
to update our base average.
--->
<cfset newNumber = randRange( 1, 10 ) />
<!---
At this point, we have two options:
1. We can add the new number to the existing collection and
then take a new average.
2. We can take the existing average of the collection and
then combine it with the new number in a *weighted* fashion
such that we only need the average and the count.
--->
<!--- Method 1: Add number to the collection. --->
<cfset arrayAppend( randomNumbers, newNumber ) />
Method 1 Average: #arrayAvg( randomNumbers )#<br />
<!--- Method 2: Create weighted average based on count. --->
<cfset newAverage = (
((baseAverage * randomCount) + newNumber) /
(randomCount + 1)
) />
Method 2 Average: #newAverage#<br />
As you can see in the code, I am creating a collection of 10 random numbers and then an 11th random number. To figure out how the 11th number changes the average of the first 10, I try two different methods:
Simply adding the 11th number to the existing collection and dividing by 11.
Multiplying the average of the first 10 by 10 and then adding the 11th (creating a weighted sum) and then dividing by 11.
I ran this code a few times to make sure nothing happened by coincidence:
Method 1 Average: 6.27272727273
Method 2 Average: 6.27272727273Method 1 Average: 6
Method 2 Average: 6Method 1 Average: 5.27272727273
Method 2 Average: 5.27272727273Method 1 Average: 3.81818181818
Method 2 Average: 3.81818181818Method 1 Average: 6.45454545455
Method 2 Average: 6.45454545455
As you can see, after several runs, both average-augmentation methods come up with the same value each time. This is really awesome (and again, I'm sure obvious to the more Mathletic among you)! What this means is that if we need to grow an average over time, we don't actually need to store the individual values - we only need to store the set size and the set average at any given time. This should make my new "Ask Ben" answer a bit more straightforward!
Want to use code from this post? Check out the license.
Reader Comments
Most people are average.
Some are more weighted than others though.
What you're looking for is called an exponential moving average.
You can easily prove this by applying some calculus that even I can understand: ;)
Given
a) M = Sum / N;
If you add the value X to the collection, then
b) Sum = Sum + X;
c) N = N + 1
now fill in b) and c) into a) :
M = (Sum + X) / (N + 1)
But it is of course always good to check these things with some concrete implementation, to see if you really understand what is going on
Whether or not it's obvious, I think it's good to post things like this. There are a surprising number of little math tricks that are handy to know, and you can save someone who doesn't know one a lot of time by sharing it.
This one I knew ... math is my strong point, plus with a gaming/sports background, I got into all kinds of this stuff growing up (batting average/winning percentage, ERA, etc.), and then working for a market research company, we got into more little tricks like flipping a scale: subtract a response from the sum of the top and bottom values to get the opposite value. (answers 1 to 5, a 2 becomes a 4: (5+1)-2)
Mmmm... looks like I missed a part in my above comment.
if
a) M = Sum / N; then
d) Sum = M * N; and
e) N = Sum / M
As I mentioned when you add a number X to the collection, then
b) Sum = Sum + X; and
c) N = N + 1
By inserting d) into b) you get
f) Sum = (M * N) + X
If you insert f) and c) into a) you get
g) M = (M * N) + X / (N + 1)
This is how I always want to implement average rating kind of features (with just two fields) but I usually have the requirement for ensuring a user can only add a rating once, hence resorting to the usual "store each value with a userID in an intermediate table" approach :P
@Phillip,
I like to think of myself as average, but only in a minority of situations :)
@Jim,
Ah, good to know this has a name.
@Martijn,
I'm just going to nod and say, Yes.
@Dave,
Most agreed. Plus, when you are solving problems, I think often times we don't always concentrate on any given aspect of it in which we might find efficiencies. All to say, it's good to think about things that might be obvious simply because they might be overlooked in different contexts as good solutions.
@Justin,
Ahhh, good point! I lost sight of that as I was thinking about weighted averages, but you are absolutely, 100% correct.
@Dave Totally agree that tricks like this one by Ben and the one you just mentioned are incredibly handy sometimes. Do you know of any reference page where they collect things like these?
@Martijn, I don't know about a page like that ... but then I'm pretty good at setting them up myself, so I don't usually look. (If it's for work, I might look for a CFC, but at home I typically want to do it myself.)
It seems like there should be a language-specific collection somewhere, though.
Just thought I'd add a bit to the discussion :)
The only caveat with this approach is that it
becomes less accurate over time because you're storing the results with limited decimal precision. It's espcially subject to variation if you use floats. Don't get me wrong - this approach is completely fine for calculations that don't require pin point precision, but maybe not for things that require accurate fractional values beyond a couple of decimal places, like financial or scientific numer crunching.
With floats, the margin of error increases fairly quickly, since every single time you add a number into the average, the lack of precision in the float data type gets factored into your calcuation as well. With decimals, the level at which your precision starts to suffer is dictated by the number of decimal places you're using. For example, if you're only storing 2 decimal places, the precision starts to get bad after a measly 100 values.
Again though, whether or not these margins of error are OK depends entirely on what kind of software you're writing.
Nice dodge! I dig the idea to force out the last bit of performance out of a chunk of code, even though it's such a minor thing.
Heard of this kinda approach in connection with "running sums".
@Roland Collins Yet, this approach still "looks" quite numerical stable (see testruns). I think the usage of floating point arithmetic in general is the caveat rather than this very approach.
@Martin - not just floating point though. Depending on what langauge you're working in, decimals can cause just as many headaches if they're not precise enough.
But again, for most applications, this approach works great. I'm just pointing out that there are several classes of problems for which the margin of error in this calculation is unacceptable.
If I wanted to do this and only carry two numbers, I'd keep track of the sum and N. Then you are pretty much accurate all the time.
average = (sum + new_number) / (N + 1)
But all this was in a former life as an analyst at DIA.
@Ben, I think you're going about this the wrong way. You're trying to use complicated techniques when there is a simple and beautiful technique readily available (a la Gary Funk's comment).
Instead of tracking the current average and the current count, and from it reverse-engineering the sum, you should keep track of the current sum and the current count, and straightforwardly calculate the average.
In the question of how to calculate new averages from old averages, obviously the difference in complexity is minimal. But reverse-engineering the old input from the old result in order to compute a new result can get very complicated very quickly, when the computations get even just a little harder. So why not make it a practice always to keep the flow of computation always going from input to output?
@Martijn van der Woud
Exactly... this is probably the first questions they asked us when we very first started programming. And that's the solution we came up with. Quick and simple.
@Roland,
Good point regarding the precision over time. I hadn't thought of that as float precision has not been something that has bitten me before.
@Gary,
Oh man, that makes a lot of sense. That never even occurred to me, but you are absolutely correct - that is much better than my approach.
@Justice,
Yeah, Gary's ideas is really good. But as far as reverse engineering, I wouldn't say that I am doing that - there would be no way to reverse engineer the values; I'm simply creating a new, weighted average.
::: WARNING - TAKE CAUTION IN READING THIS REPLY IT MIGHT MAKE YOUR HEAD HURT - OR NOT :::
Something to consider. Assume you have a website where you did in fact want to see how each of your users voted on a certain poll or question but still want to be able to calculate the averages. It is really simple when you see it in action, and once you start playing around with the idea the sky is the limit...
Okay Assume this is your database table ratings or whatever
POLLS, RATINGS or WHATEVER
ID | RATING | VOTERS
::::::::::::::::::::::::::::::::::::::::
1 | 4,3,2,5,1,3,4,3 | 1,8,2,4,5,3,7,9
----------------------------------------
Now of course you would query your database to get the voters and there ratings.
Once you have retrieved the data you can store the data into variables of your choosing.
For Example
<cfset myRatings = "#queryName.ratings#">
<cfset myVoters = "#queryName.voters#">
Okay Now to Continue...
From this we would be okay so how do we count how many votes there was?
Well ColdFusion has a beautiful function called listLen it will tell you how long your list ( number of voters) is which you will use later on to calculate your average.
<cfset myRatings_len = listLen(myRatings) >
After you have your list length you can do a list loop in which you add each value together. Once you have each value totaled you can then do your math by having your total divided by the length of the list.
Now every time you have a new voter you can just grab the list and then do a listPrepend which would add the voters to the front of the list allowing you to see who voted, and in which order. You could even have another field that has the time of the vote, or ip to help prevent double voting.
If someone want to see a working example just let me know and I will have the source code ready and what knot.
* To learn more about list and other fun stuff go to
http://www.quackit.com/coldfusion/tutorial/coldfusion_lists.cfm
@Jody: From a ColdFusion perspective, that's a reasonable approach to storing both values and users, but it's probably more useful for smaller datasets. With larger datasets, you may run into problems storing the lists, depending on the database you're using.
From a database perspective, if you want to keep the individual values, it is probably better practice to store the records separately rather than combining them into a single record. By doing so, you're able to use the native database functions for mean (average) and counts, as well as for more advanced metrics if you have a need for them. It is also considerably easier to modify the dataset or to store additional information if necessary when each vote is its own row in the table.
Also, just as an FYI, you don't need to wrap your variables in quotes and pound signs. This will work jsut fine:
cfset myRatings = queryName.ratings
cfset myVoters = queryName.voters
In fact, when you do wrap them, you add probably add overhead in this case due to string conversions.
@Roland Collins
It is safer to wrap string variables in quotes and pound signs. Integers I could see but never strings.
Sorry to disagree with you, but that's just not true. It doesn't do anything except cause extra string conversions. It doesn't add any "protection". In fact, the docs recommend against the use of unnecessary pound signs.
http://livedocs.adobe.com/coldfusion/6/Developing_ColdFusion_MX_Applications_with_CFML/Expressions3.htm
It's okay to disagree its human nature. But, I don't think so. I do believe it is better to stay consistent. Can you please give me one example where storing a variable in that matter ruined the application, or slowed it down? I bet you can't... I can't think of one time it gave me an error.
"However, surrounding all attribute values in quotation marks is more consistent with HTML coding style." <--- quote from adobe site...
So my logic for doing so is correct.
but it does say its not nesscary to use the pound signs... ( in which I already know) it said it wasn't nesscary not that you shouldn't do it.
@Jody.
I am sure that a bad technique for storing data will not give you an error if you also write corresponding code for storing and retrieving the data. But it does make many things far more difficult. For example, you can no longer ask the database to perform averages or standard deviations on the data (you will want this if you are tracking votes). It is far more difficult to add a column to the votes. In fact, what you are doing is inventing your own naive storage mechanism based on the database system, which means now you need to implement new techniques or conventions for all of the basic operations on data. The alternative, of course, is simply to create and use a table of votes, with one row per vote.
It is not better to stay consistent for the purpose of staying consistent, when that means you start doing the wrong things or you start doing things that make no sense. In this case, of course, that is string-interpolating expressions when there is absolutely no need to. Doing so wrong in general. It happens to work sometimes because all simple data types - booleans, numerics, strings, dates and times - are stored in ColdFusion as strings anyway (if they were not stored as strings, string-interpolated numeric expressions would get you a string representation of a numeric when what you wanted would have been a numeric). But it will not work with complex data types - arrays, structs, queries, components, java objects, com objects, net objects, etc.
You reduce the purpose of databases to absurdity by storing lists in a single field. As already stressed by Dave it's better (for the performance) to "roll out" the lists on to a convenient scheme. Database management systems are superfast. Apart from totaling and maximization, it's possible to use regular expressions in the queries. It's always faster than iterating over the data "by hand".
How is it a bad technique? What if I don't want to use the database functionality? What if I think the database functionality blows? What if I want to reduce the amount of rows of data I have? Are you saying that is a bad method? I have written some pretty robust systems ( in php ) for a few companies using this same technique. Not only did I increase performance but the maintenance went down tremendously.
If you don't want to use the database functionality, then don't use a database ;)
I can't comment on what you've done in the past, and if your solution met the needs of the problem you were trying to solve, then that's great. I wouldn't ever presume that the work you did wasn't solid. As long as you're meeting the needs of your client or company, then more power to you.
We're just trying to point out that certain aspects of the way you solved the problem are not necessarily the best approach under usual circumstances. Without having seen the system that you were working on, it's really hard to provide any valid feedback. All we're saying is that, in general, the approach you have outlined doesn't really adhere to best practices. Again - this is not a condemnation of your work. We're merely trying to point out that, in general, based on what little we know of what you've worked on, the approach you took would not be the preferred course.
@Roland
I generally only like to use databases to store and retrieve information. I do not like to use the database to do my math for me. Why? I have more precision if I do it own my own. Also, to call my approach "Naive" is a little bit rude. I can clearly tell that you are a very negative person who always has a complaint even from your first post on this particular blog. Criticism is okay but to ridicule an approach especially when the approach works is just mean -- yeah I said it MEAN! -- ( ha ha ). You are going to have to remember something Roland and that theres is always more than one way to do a job. I know many programmers that would tell you that using your own script, instead of using the database to calculate totals is far more effective and precise. But hey what do they know they only work for a few of the top 500 companies in America. So bottom line don't say anything if the approach works, everyone is different some people are confined to there little box, while others are outside the box looking for more effective ways to optimize and store, calculate and manipulate data. You can think of it like a car drive, surely if you take the highway you will get there easier, but if you take the back roads you generally would end up getting there faster. -- If you don't know what that means you should take Philosophy.
Well enough said, I have work to do.
Jody Fitzptrick
@Jody, @Roland
I think you both made your points clear. Now would be a good time to stop your catfight. I am getting bored...
Martijn
@Martijn,
Agreed. I think we all love lists and we all love databases... so, let's just move on.
Sorry Ben - last post, I promise.
Jody, at no point did I call your solution naive, nor did I riducule you. In fact, I clearly stated that if you're getting the job done, then that's great.
I'm sorry if you took offense to anything. My only intent was to suggest that there are other ways of tackling the problem. Again, my apologies if I offended you.
@Roland,
No worries my man - all conversation is good conversation.
It's all good, I took no offense... It's the internet, If I took everything to heart that I read on the internet I would be depressed for ever.
@Jody,
Word up to that! The internet gives you a thick skin :)
nice but a the need to continually increment the sample count will/maybe cause counter overflows etc not good in spaceships etc ... I started with your method however I arrived at this one ....
loop: sum for n samples--do this for n samples
store sum in sum1
average = sum1/n
sum1 = sum1/2 --initialise sum1 to half sum1
goto loop --keep going forever
the disadavantage is that this method only updates the average every n samples.
correction .. sum for n/2 samples where n is the no of samples to be averaged.
loop: sum for n/2 samples --do this for n/2 samples
store sum in sum1
average = sum1/n
sum1 = sum1/2 --initialise sum1 to half sum1
goto loop --keep going forever
If you are afraid to overflow Sum, here's an equivalent formula:
M_0 = 0
M_{i+1} = M_i + (x_{i+1} - M_i)/(i+1)
In pseudo-code:
n = M = 0
...
n = n + 1
M = M + (x-M)/n
Here M is the (cumulative moving) average, x is the new value in the
sequence, n is the count of values.
The downside of this method is that when averaging a huge sequence of
values you may loose precision over time because each time you get an
imprecise value for (the floating-point) M.