Story Point Estimates – Under-Estimating Large Items

While doing Scrum at Danube, we’ve long espoused the value of using relative, story point estimates over estimates based on strict chronology. We’ve written papers on why macro metrics are better than granular task based estimates due to the inherent uncertainty latent at the task level. And we eat our own dog food; the ScrumWorks team uses relative estimation units (labeled “headaches”, for fun) in estimating stories/backlog items.

Recently though, a new team member (we’ll call “Ed”) was baffled by the team’s estimation scale. Ed had trouble acclimatizing to our estimation scale, and at a recent retrospective he pointed out some major problems with the way our team seemed to be estimating backlog items.

The team had been using a scale that started with “1″ for trivial changes, “2 through 6″ for work manageable inside a single sprint, and an “8″ for the biggest item we’d want to take on in a sprint without breaking down further. The team used “10″, “12″, “16″ and so forth to estimate items that were typically too big for a single sprint and needed to be broken down. The team’s velocity was pretty consistently in the neighborhood of 40 headaches per sprint.

Ed noticed though that our scale seemed to be logarithmically correlated to the effort the team perceived would be necessary to complete the items. That is, a “4″ was not four times the size of a “1″, and an “8″ was not twice the size of a “4″. The team might be able to do 60 “1″s in a sprint, but only 10 “4″s. Likewise, the team could only handle three or four “8″s in a sprint.

This wasn’t an after the fact realization either. The team knew at the time of estimation that a “4″ is way larger than four “1″s. A frequent statement during sprint planning was “Well, we could do eight more points of smaller stuff, but not a single item estimated at 8!”

Why does this matter? If the team produces a stable velocity each sprint, doesn’t that supply enough to forecast accurately? Actually, it doesn’t. The team was getting a stable velocity because we usually had a mixed-bag of estimate sizes each sprint: there would be some 1s, some 2s, some 4s, etc. However, our product backlog was mostly compromised of large items with bigger estimates (8s, 10s, 12s, etc.).

But this means that the 8s, 10s, and 12s on the backlog are actually much larger than the estimates reveal. So while there might be 100 points on the release backlog outstanding, because the estimates are in effect “low” our ability to forecast is thrown off course.

Once the truth of Ed’s observations sank in we immediately took action to create a new estimation scale that was completely linear. The Fibonacci sequence is a popular scale but we actually settled on powers of two. We didn’t want there to be much granularity at the low end of the sale, nor at the high end, and powers of two helped us achieve that goal.

Now the team is re-learning to estimate in a linear way. Yes, it takes some practice and getting used to, but I’m happy that our ability to forecast has been greatly improved!

Victor Szalvay

Victor Szalvay currently leads product development for CollabNet’s ScrumWorks® product suite. In that capacity, he works closely with customers, stakeholders, and the development teams to deliver high business value each release cycle. With more than 150,000 active users worldwide, ScrumWorks is used by more than half of the Fortune 100 and boasts the largest market share of any Agile management tool.

Posted in Agile
3 comments on “Story Point Estimates – Under-Estimating Large Items
  1. When you said linear scale, I thought you meant the gaps between the allowable numbers, not the atomic unit of measure. This is a good post; I just wanted you to know that the term might confuse people due to its alternate use.

    You triggered me to put up a thorough post about what makes a good story point scale including this clarification.

  2. Victor Szalvay says:

    The title shouldn’t be “linear scale”, I’ll fix it up. Glad you found it useful.

  3. Michael James says:

    I think the underlying reason is that the technical risk increases with larger effort items, and humans are psychologically uncomfortable calling this out.

    It’s more comfortable to go from 8 to 13 to 21 (the Fibonacci scale) than from 8 to 16 to 32 (powers of two) even though the latter will turn out to be more accurate.

    –mj

Leave a Reply

Your email address will not be published. Required fields are marked *

*

CAPTCHA Image

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

connect with CollabNet
sign up for emails
conversations

CollabNet: Don't forget to register for our upcoming Certified #Scrum Master Course in Baltimore, MD! Apr 28-29 http://t.co/MJTfBMRAz9
Date: 22 April 2014 | 10:30 pm

CollabNet: Register today! How the Government is Utilizing Open Source Webcast Series http://t.co/gQ89IyrsvM Thursday – April 24th, 2014 – 2:00pm ET
Date: 21 April 2014 | 10:15 pm

CollabNet: @dribback1: discusses "The #Agile Regression: How to Avoid Moving Backwards" in our blog http://t.co/WMBdc8JmrI
Date: 21 April 2014 | 5:42 pm