27 August 2012

How to screw up a release

As I write this, the QA group is beta testing Firestorm 4.2.2.29837. This has been one hell of a long day.

Yesterday, we were all hopeful and happy that we'd gotten 4.2.1 the hell out the door. We'd been beating on it to get the pathfinding tools in, and a bunch of crash fixes and translation updates and a few very highly requested features (control-shift-E for Edit Linked Parts being my primary original contribution, though I also ported the Flickr snapshot upload from Exodus) as well. We QAd it, we beat on it, everything looked good. We pushed all the needed changes to the various repositories, told the users to grab it and have fun, and went to bed.

I got up about 5:40 AM my time (US Central, GMT-6/SLT+2). When I sat down in front of the computer, I saw in the support chat that there was a problem. It turned out that 4.2.1 has a bug in it that makes most swinging doors not look like they act correctly (though they actually do). They appear to swing normally, then jump ahead, swinging farther all at once.

You can guess how many swinging doors there are in SL. You'll probably guess low. No, I don't have a number, but it's gotta be astronomical.

The thing that annoyed me especially: The front doors on my own house showed the bug! Worse, I'd noticed the issue a week or so ago, and blown it off!

Fortunately, our ace support person and walking JIRA encyclopedia, Whirly Fizzle, had already found the changesets that caused the issue. I whipped up a quick build and saw that yes, backing them out did fix the problem. A bit more jockeying, and I had a recommended course of action: back out three changesets directly in the release branch of the repository, bump the version number to 4.2.2, and ship it.

Oh, if it were that simple.

First, the build servers were behind a fiber cut as a result of an automobile accident in Boston. That delayed spinning the new release builds.

Then, while we were waiting on that, we discovered another problem, with another patch: some spinning objects stuttered and didn't show correctly if they updated while spinning. This is the problem that the patches we backed out were supposed to fix. We found the changeset that caused that and backed it out, and it seemed to fix the problem, with no side effects.

But we couldn't be sure. The LL JIRA that that changeset was reported to fix, PATH-542, was (and still is) secret. So how the hell do we decide? Have we reached the end of the string, or are there nasty side effects of not fixing that one? Without knowing what the problem is, we can't make an intelligent decision on what to do with it.

We spent a large chunk of the afternoon trying to figure out what to do next. This time was completely wasted because of the JIRA being kept secret. Finally, about dinnertime, we got enough of a hint as to what the problem was that we were able to exercise it - and decide that not only was the original behavior not a bug, at least at the level of the Firestorm codebase (LL 3.3.3), it was actually the way things should behave.

So we declared it fixed and built release binaries. That's what QA's poking at now.

Jessica Lyon is not at all happy that we had to back out a release. I'm not either. Worse, I feel some responsibility for not saying anything.

Where did we screw up? To examine this, we need to detour for a moment into the world of fail that's been the LL pathfinding release. The pathfinding code has been rather epically broken at just about every step of the way. The problems ranged from broken physics to sitting on the ground failing in rather entertaining ways to the world and minimaps being mis-scaled to the toolset in the viewer being very, very unstable. (This is the reason that LL 3.4.0 is taking so long. It's really, really not pretty.)

We fought a lot of this while putting the pathfinding tools into Firestorm. We saw the effects of a whole bunch of these problems, to the point we got to thinking "Oh, something else broke? Must be pathfinding fail." That is exactly what I thought when I saw my front doors broken a week or so ago...and it cost us.

I'm not the only one. More than a few of the support folks and beta testers report the same thinking.

The lesson is obvious: Even - no, especially - when dealing with known LL fail, we need to investigate every problem we see. No matter how much it seems that it's just another LL screwup. Every problem. Period.

There's another lesson, and that's that LL's entirely too secretive when it comes to many bugs. Yes, I can see keeping details of LL's infrastructure secret, and it goes without saying that SECurity JIRAs need to be secret. There's simply no good reason for the others, though, especially once they've been fixed. The only reason is to keep TPV developers in the dark and make us reinvent wheels.

I hate reinventing wheels. If you're lucky, you end up with a pentagon.

So here we are; before I go to bed, 4.2.2 will be released, full of goodness. But a lot of us wasted a lot of time because nobody said anything about a bug many of us saw. That's gotta stop. It will stop.

24 comments:

  1. Well,very good written Toy ! And i agree with you,but i don't feel happy when i must again uninstall and then again install,but to the end,we're ppls behind of keyboards...yeah,does it help to say that.? Propably not,but with mistakes,everyone learn.

    ReplyDelete
  2. Hugs - and keep up this refreshingly open attitude towards mistakes in your great work. I love this Firestorm :-)

    ReplyDelete
  3. Perhaps that LL is not very proud of the relative "success" of its viewer, heavy, very average quality graphics and impractical to use ... just have a look at the number of SL not LL viewers : the majority.
    For me Firestorm is actually the best.

    ReplyDelete
  4. I don’t think this will stop.

    When Linden Lab had changed their TPV policy in winter*, I’ve gotten the idea that they want to get rid of the TPVs altogether in the long run. The new Havok-for-Second-Life-only deal and what goes with it seems to point in the same direction.

    I expect Linden Lab to reduce the open-source part of Second Life little by little until working on TPVs will be a complete mess and very unattractive for volunteering open-source developers. It’s kind of a “if you can’t throw them out, wear them down”-strategy. This may look short-sighted, and maybe it is. But from a corporate standpoint it makes sense: be in control of your product, raise your profit. Albeit an observer from the outside might wonder if that will work as planned. Looking at Linden Lab’s viewer market-share shows that their responsiveness to user needs and wants seems too feeble to achieve that aim. For a long time the TPVs have been fondling Second Life’s user base — Linden Lab’s customers — for free. That made Linden Lab’s shortcomings less apparent.

    As soon as Linden Lab’s TPV policy might start to backfire, there’s a good chance that the TPV developers and the Second Life residents who used to be using their front ends will already be past caring. And possibly elsewhere.

    * http://phoenixviewer.blogspot.com/2012/02/new-additions-to-third-party-viewer.html (Read the follow-up posts there, too.)

    ReplyDelete
  5. According to your explanation it seems the Abilene Paradox occurred. It is common and it happens. The Abilene Paradox is basically the ability of a team leader to make a poor decision and the rest of the team going along with it. This is extremely bad for a team working on a project because the project will simply become worse the longer it takes to correct the bad decision. A team should work as a team and speak up if a decision seems to be a bad decision.

    It happens and your team did an excellent job fixing it. Pathfinding released by LL has been nothing short of a complete mess. I am a warbug pilot and builder in second life. Warbugs are small planes used in dogfights and bombing runs. Pathfinding makes the warbug planes unusable if turned on and causes occasional errors when it is turned off. LL should have done a better job before releasing it.

    ReplyDelete
    Replies
    1. Combine the Abilene paradox* with the Dunning–Kruger effect**, and you’ve got a perfect mess. And a pretty good idea why there’s so many wrong people in the wrong positions.

      * https://en.wikipedia.org/wiki/Abilene_Paradox
      ** https://en.wikipedia.org/wiki/Dunning-Kruger_effect

      Delete
    2. I'm not sure the Abilene Paradox strictly applies, because there was no one poor decision involved; rather, it was more a series of inactions on the same issue by multiple team members, separately. Still, your point is very valid.

      When we decided to back out 4.2.1 and spin a new version, I said "This is where we get to show how nimble we are." We did...we could be nimbler, but we're still more nimble than LL.

      The team deserves a lot of credit for pulling together and getting the problem fixed quickly.

      Delete
  6. The effort is appreciated.
    Thank you.

    ReplyDelete
  7. dont quit your day job

    ReplyDelete
    Replies
    1. Easy to sit back and be snarky. What have you done for free lately that benefits others?

      Delete
  8. Regardless of the issues, Firestorm works... Reality is that irrespective of all the principles, paradoxes and complaints, if you aren't making mistakes, you aren't doing anything. The world would be a better place if more people used their mistakes as learning opportunities rather than giving up. Great work Tonya and team!!!!

    ReplyDelete
  9. I didn't have 4.2.1 long enough to notice these issues before I got the prompt to get 4.2.2 ...That's what I call expedient repair.

    What is really pleasing, is that you are transparent about the mistake and what caused the slip-up, rather then wrapping what happened up in secrecy. We can see something went wrong, why, that it was fixed, and how...which is more then can be said for...others, lol.

    Glad the firestorm team is on top of things like this. When I see an update right after an update. I don't think "they must have goofed" so much as I think "Good they are looking out for us".

    Thank you for all the hard work you have done and that is to come.

    ReplyDelete
  10. I don't know much about the technicalities & I don't care for the politics & bitching. I am just grateful that you & every member of the team, including those poor guys & gals in the support group, do what you do & do it with such grace. There should be a huge banner over every page & in the support group that says "give us a break, we do this for free ya know!". I wouldn't still be in SL if you hadn't all made such a great job of Firestorm, LL viewer is unusable to me. I thank you sincerely for working so hard & giving us a viewer that may have a hick along the way but my goodness you all worked hard to fix it so quickly! If only LL learned from their mistakes like you will then SL would be an even better place. Thank you :o)

    ReplyDelete
  11. It is refreshing to see this transparency, especially when it reveals such a thorough ethic, and drive to "get it right, no matter what." It's so sad that so many big-name-developers don't think this way.

    *cough* ..micro.. *cough* ..soft.. *cough* ..updates.. *cough*

    Sorry, just had to clear my throat.

    Good job!

    ReplyDelete
  12. Man I am glad you guys are here bringing us the best viewer currently available. Not striving to reinvent the wheels is a sign of apathy. I think you guys have reinvented the wheel and I sure appreciate it.

    Errors happen when you are human, and since you are human in rl, I blame your parents.

    Thanks.

    ReplyDelete
  13. I am a medical professional... my mistakes are terrible ... the problems of SL are just mild irritations ...nobody died.. some doors swang ... all is fine and you are great..!!!

    ReplyDelete
  14. No complaining here... you are the best there is out there. That's important to remember.

    Thanks to you all for your hard work.

    ReplyDelete
  15. What I am curious about is if you intend to put the same lighting and shadows effects back into Firestorm that were present in the last Beta, I really hated leaving that one simply because the textures on many of my horses look like plastic without those settings.

    ReplyDelete
  16. All this pathfinder stuff is clueless to me. I know its important to builders n such. But can you please fix the CRASHING! Wont be able to do anything if you cant stay ONLINE!
    I had to go back to the Phoenix Viewer as Firestorm & LL Viewer threw me off all the time.

    I just hope this works

    ReplyDelete
  17. Not only the crashing but also I have noticed that now i have to click twice or more for an item to rez or to wear it. What's with that? And the objects I have bought on MP just wont show :(

    ReplyDelete
  18. iv got largest note to add. a lot people still having huger issues of not being to upload without instant crash or firestorm freezing and locking up their computer. and i for one have had this issue since 4.0 releases. iv been updating none stop in hopes it'd be fixed, but i still have same issues. i would just switch to another viewer, but firestorm is only one that is lag free for me and properly displays higher end of graphics.

    ReplyDelete
  19. I think it would be interesting to QA SL. Can I take ownership of testing RLV features? Please?! I can see it now: "What do you do for a living?" "Oh, I have kinky sex online in pixels...and I get PAID for it too!"

    Seriously though, I can see how easy it would be to blame LL for issues found during testing when there are so many things that behave badly due to lag etc. Hats off to the QA team that catches all the bugs!

    ReplyDelete
  20. Thanks for the kind words, folks.

    Those of you who are crashing: Have you done a truly clean install as the instructions on the Phoenix wiki tell you? If not, do that first. It really does help.

    And, Joseph, the lighting and shadows are entirely Linden Lab's doing. Do they show up correctly on their viewer?

    Cherry: No, you can't have ownership of testing RLV. That's my job. :3

    ReplyDelete
  21. I think an important lesson here is that where something is really truly broken it will cascade and you can't avoid that. Really you have to manage the problem and accept that crap will happen until you get that fixed.

    Secondly, and I say this as someone who has repeatedly found myself in the same sort of situation regarding broken code, if the pathfinding code is that broken,and has been for this period of time, that means it is probably poorly thought out and LL needs help. You might think about trying to get people together who have the related expertise, and approaching LL about helping them fix the structure of this code.

    Code that is very badly broken often needs to be rethought, re-architected. And until this happens everyone will suffer.

    ReplyDelete