Author Topic: Epic Bughunt Claims Sanity, Human Blood  (Read 21784 times)

0 Members and 1 Guest are viewing this topic.

Offline General Battuta

  • Poe's Law In Action
  • 214
  • i wonder when my postcount will exceed my iq
Epic Bughunt Claims Sanity, Human Blood
About fifty hours ago we were in the process of wrapping up another mission. This was a pretty FRED-intensive mission using some relatively complex stuff: variables, set-object-positions using relative coordinates, skybox changes and the like.

I was pretty pleased with the outcome. We'd iterated this mission a lot, and this was the third revision of the concept. Finally, after days of work stretched across months of real time, it felt like it had worked out pretty well.

I made a few changes and ran through the mission for the Nth time to make sure everything was working.

About halfway through the mission, right when a critical event fired, the world was devoured. The ships, the skybox, even the player's own vessel all vanished, and the screen was filled with inky blackness. All that was left was the HUD and the sun, and even those weren't working right: you could still target objects, but they were (like everything else) invisible, and the distance and speed monitors were both blank. Even the ambient hum of the engines was gone. The 3D radar, meanwhile, had migrated up into the spot where head .anis are usually displayed.

The game's rendering engine seemed like it had just quit. Given up the ghost. All that was left were the 2D elements. It was as if Ransom's Transcendent had decided to pay a visit and eat reality.

I stared at this in bemusement and horror, posted a brief note on IRC, and restarted the mission.

Same thing. Same point in the mission.

I looked at the event in question. There was a lot of stuff going on right then: skybox changes, coordinate changes, coordinates pulled from variables, engine subsystem repairs, play-sound-from-files, fadeouts. It was a nightmare. Any one of those elements (or a combination thereof) could hypothetically be pushing the engine too far.

But it wasn't worth panicking until it could be reproduced. I posted the mission on SVN and, sure enough, The_E and others got the same result too. But they didn't get it every time. I kept changing things in the mission and testing, and sometimes I'd see it work perfectly, and I'd commit it with a happy note like IT'S FIXED...and then The_E would try the 'fixed' version and hit the same bug.

Now, for a bit of background. We'd met this bug before. Previously, we'd always encountered it in relation to a bad asset: first the Vishnan Keeper, then the Anemoi. In both cases we figured we'd solved it by cleaning up the .pof.

Yet none of the assets in this mission were bad, so far as we could tell. For some reason, the mission was eating itself at this point.

First we tried debug logs. They were useless. That must have eaten up an hour or so.

Next, The_E built a debug filter to get additional information. While he was doing that, I started systematically pulling newer SEXPS away from the event to see what could be causing the quit. Surely the coordinate manipulation or skybox change could explain it. Or perhaps it was the every-time conditional used in a linked event; every-time is notoriously hazardous.

Nothing helped. I pared the event down to two SEXPs: an engine subsystem repair and a hud re-enable. Since the engine repair was a retail SEXP, and the hud problem wasn't causing it (I could move it around with no effect), we were right back where we started.

Now bear in mind I'd been working on the third iteration of this mission for a great many hours already. I was sick, I was exhausted, and somehow I'd managed to break a glass during the trial-and-error debugging process, cut myself, and start bleeding everywhere.

The_E reported in. He'd been backtracking a stream of errors, starting in the sound code and moving all the way back to what appeared to be the original cause. He figured out the cause of the problem, but not the trigger. Somehow the engine was losing track of the player's position, converting a valid vector into NaN (not a number.) There was no error checking here, so as I understand it, the error propagated through the whole engine, smashing things apart like a Shivan hitchhiker on a Thoth.

But he couldn't figure out what caused the NaN error in the first place. We knew it always happened at the time this event fired, but now that it had been stripped down to fundamentally 'safe' levels, how could it cause the problem?

Blinded by RAGE at this bizarre bug crippling my brilliant mission, I started swapping out all the custom assets for MediaVPs ones.

And the bug went away.

I boggled. We had been certain it wasn't caused by bad assets this time. But every time I tried the mission (time-compressing through to the point of the bug) with the Perseus swapped in for the optimized Uhlan, the bug didn't seem to happen.

Except for The_E. He was able to reproduce it with the HTL Perseus. Which meant it couldn't be the Uhlan either.

At this point we were all taking this bug very personally. The_E started doing something very technical which he could probably explain better, but which seemed to involve running the engine forward frame-by-frame, looking for the moment when everything went horribly wrong in the player eyepoint vector. This was an insanely laborious process that, to the best of my knowledge, ate up hours of his time.

Take a moment to think about the timeline here. By this point, more than 24 hours had passed since we'd first posted about this bug on an SCP internal forum and Mantised a test mission reproducing it. Nobody had a bloody clue what was going on.

Finally, almost 48 hours (if I remember right) after first encountering it, I did something nonsensical but ultimately productive: I tried deleting the repair-engine SEXP from the event. This was a retail SEXP, so it should have been safe.

The bug went away. The mission kept on happily rendering. There was no inky void of all-consuming night. Yet there was no way that SEXP could have caused it, because sometimes - maddeningly - the mission worked fine even with it in place.

Truly this was the bug from hell.

Mere minutes later, working independently, IssMneur on #SCP struck gold. "Hey," he said (I paraphrase here.) "Have you been matching speed with something during the mission?"

We all ran through the mission, matching speed with a certain ship. The rendering engine quit when the engine-repair SEXP fired.

We ran through again without matching speed. The mission worked fine.

Matching speed was the source of the problem. When I'd swapped in the Perseus for the Uhlan earlier, I'd simply time-compressed to the failure point without bothering to 'play' the mission, and thus never matched speed. That's why the bug had gone away there.

After some frenzied diagnosis, IssMneur and The_E explained what had happened. Apparently, if the player was matching speed with a target ship at the time that they were disabled, but then later repaired, the game engine would divide zero by zero at the moment of the repair and produce a nonsensical result which worked its way into the player's position. The rendering engine would fumble, choke, and stop working.

As far as I know this could have happened in retail.

They built a patch for the issue and promptly generated a new build that fixed it. I cleaned the blood spatter off my keyboard.

Then the mission started crashing for a totally different reason. But that's a story for another time, and now it's fixed too.

The_E may be able to provide more insight into the hilariously random and difficult-to-reproduce technical side of the bug. But that's my part of the story.
« Last Edit: April 16, 2010, 12:55:37 pm by General Battuta »

 

Offline Droid803

  • Trusted poster of legit stuff
  • 213
  • /人 ◕ ‿‿ ◕ 人\ Do you want to be a Magical Girl?
    • Skype
    • Steam
Re: Epic Bughunt Claims Sanity, Human Blood
The lesson:

Don't use match speed.
Use the Z key like a real man! :P

(I know I'm missing the point)
(´・ω・`)
=============================================================

  

Offline Spoon

  • 212
  • ヾ(´︶`♡)ノ
Re: Epic Bughunt Claims Sanity, Human Blood
Oh wow, just wow
Hilarious but at the same time I can completely understand the frustration this must have caused.

I encountered this black void of doom only once my self. I had just build a new ship and was going to test it out ingame, the ship warped in... and then vanished. Then I moved my own ship forward and everything went into a blackhole. That scared the living **** out of me but then it never showed up again.
Urutorahappī!!

[02:42] <@Axem> spoon somethings wrong
[02:42] <@Axem> critically wrong
[02:42] <@Axem> im happy with these missions now
[02:44] <@Axem> well
[02:44] <@Axem> with 2 of them

 

Offline Sushi

  • Art Critic
  • 211
Re: Epic Bughunt Claims Sanity, Human Blood
The real lesson:

Don't assume that retail stuff works right. :)


Also, major kudos to those of you who fought and defeated this behemoth.

 

Offline MatthTheGeek

  • Captain Obvious
  • 212
  • Frenchie McFrenchface
Re: Epic Bughunt Claims Sanity, Human Blood
I remember I also managed to reach the black void of doom while playing around with Fury's AI, when I made a type which assigned a ship to an AI class that didn't exist :/

Ah, the delights of modding... :D
People are stupid, therefore anything popular is at best suspicious.

Homeworld: Blue Planet     -     Help us help you     -     FSO Installer     -     BP extras portal     -     My legacy

666maslo666: Releasing a finished product is not a good thing! It is a modern fad.

SpardaSon21: it seems like you exist in a permanent state of half-joking misanthropy

Axem: when you put it like that, i sound like an insane person

bigchunk1: it's not retarded it's american!
bigchunk1: ...

batwota: steele's maneuvering for the coup de gras
MatthTheGeek: you mispelled grâce
Awaesaar: grace
batwota: oh right :P
Darius: ah!
Darius: yes, i like that
MatthTheGeek: the way you just spelled it it means fat
Awaesaar: +accent I forgot how to keyboard
MatthTheGeek: or grease
Darius: the killing fat!
Axem: jabba does the coup de gras
MatthTheGeek: XD
Axem: bring me solo and a cookie

 

Offline The E

  • He's Ebeneezer Goode
  • Global Moderator
  • 213
  • Nothing personal, just tech support.
    • Steam
    • Twitter
Re: Epic Bughunt Claims Sanity, Human Blood
The Bug From Hell, or: Is it 6AM again?

Here I am, explaining the technical bits.

And bringing you a shot of the bug in action:


The first inkling I had that something was wrong was a small Assertion while running the buggy mission. Now, usually, Assertions are good things to have when debugging, it means that a) there's a clear starting point, and b) another developer had the foresight to check his data and provide a warning in case it was bogus.
So, with the assertion in hand, I dug in. For those of you interested, the assertion was "im > 0.0f" from vecmat.cpp, line 543. This info, by itself, is useless, since the function that the game is processing at that time (vm_vec_normalize()) is called from literally dozens of places.
However, since I happen to have Visual Studio (Microsoft's developer environment), I was able to perform what amounts to a live dissection of the crashed program, yielding the ever-important callstack. This told me the state of the engine at the time of the crash.

What I saw, however, was extremely puzzling. The callstack pointed towards the function that is responsible for generating the three-dimensional sound environment1 , which needs the current camera position as a starting point.
In this case, the global variable that stores the camera's position was apparently corrupted. This is something that should never, ever happen, and I was astonished to see that this was the first visible error message thrown.

So, I did what a developer debugging a large codebase does in these cases. I sprinkled breakpoints2 everywhere I thought they were needed, and ran the mission again.
And this is basically what I did the next few hours. Run the game, inspect the data at the breakpoints, reset breakpoints, run game again, lather, rinse, repeat.
Gradually, through trial and error, it was revealed that we were dealing with a cascading failure.
One that was literally happening in the tiny space between two frames.
As far as I could determine, what happened was that at some point, the player's throttle setting got invalid. When this invalid throttle setting was fed into the physics engine, it tried to calculate the player's velocity based on that, which again yielded invalid results. Then, in the next frame, the physics engine tried to calculate the player's new position using the invalid speed data, which produced coordinates that lay somewhere outside of the known universe.
This data, in turn, was used by the rendering engine as input, which caused it to basically throw up its arms and do as it was told (the mental image I had was of a tired office worker, who basically says "If management wants me to do something stupid, I'll do something stupid").
And on and on, until that assertion in a part of the code that was so far removed from the actual cause as possible.

Which brings me to the cause.
I can't take credit for discovering it, IssMneur was the one who nailed it by stumbling across the strange fact that if the automatic speed matching was turned off, the bug didn't happen.
Coupled with Battutas' discovery that the difference between the bug appearing and it staying hidden was a repair-subsystem sexp, he investigated, tracing through the code until he found the cause of this error.
Now, what happened was:

1. Player's engines get sabotaged via sabotage-subsystem sexp
2. Seeing that the player no longer has engines, the game sets the maximum speed the player can reach to 0.

This is normal so far.
However, if the player happens to be matching speed with another ship, the game tries to calculate the throttle setting necessary to do that by dividing the target's speed by the player's current max speed.

Feel free to headdesk now. I know I did. Because the code never checked to see if the divisor was greater than zero, we had a classical error on our hands.

End result, after countless hours of staring at variables? Two new lines of code that should bury this beast for good.


1: Basically, the game tries to place each sound source that is active in a 3d environment around the currently active camera, and adjust the sound effects' volume accordingly.
2: A Breakpoint is a marker that enables the debugger (NOT the debug build, I'm talking about the Visual Studio debugger here) to automatically halt execution. They are invaluable when trying to analyze a running program.
Let there be light
Let there be moon
Let there be stars and let there be you
Let there be monsters and let there be pain
Let us begin to feel again
--Devin Townsend, Genesis

 

Offline mjn.mixael

  • Cutscene Master
  • 212
  • Anims: 420, Cutscenes: 10, Mainhalls: 7, Logos: 52
    • Steam
    • Twitter
    • Mix-Hai Productions
Re: Epic Bughunt Claims Sanity, Human Blood
I'm astonished. Truly. I would have given up hours and hours ago.

Maybe some sort of division error checking should be added to FSO... I'm not a coder so I don't even know if that is possible or plausible.
Cutscene Upgrade Project - Mainhall Remakes - MixaelANITools - Between the Ashes - MjnMixael's Render Boutique - Mix-Hai Productions
Youtube Channel - P3D Model Box - Photobucket Albums - Model Releases - Downloads
Between the Ashes is looking for committed testers, PM me for details.
Report MediaVP issues, now on the MediaVP Mantis! Read all about it Here!

 

Offline Commander Zane

  • 212
  • Spoot Knight of Anvils
Re: Epic Bughunt Claims Sanity, Human Blood
That **** sounds intense, nice job on cracking the problem.

 

Offline jr2

  • The Mail Man
  • 212
  • It's prounounced jayartoo 0x6A7232
    • Steam
Re: Epic Bughunt Claims Sanity, Human Blood
Awesome work, guys!  :nod:

 

Offline Droid803

  • Trusted poster of legit stuff
  • 213
  • /人 ◕ ‿‿ ◕ 人\ Do you want to be a Magical Girl?
    • Skype
    • Steam
Re: Epic Bughunt Claims Sanity, Human Blood
Oh, on a somewhat unrelated note, is this issue connected to the one where sometimes, upon re-trying a mission, the player ship is...gone?
(´・ω・`)
=============================================================

 

Offline The E

  • He's Ebeneezer Goode
  • Global Moderator
  • 213
  • Nothing personal, just tech support.
    • Steam
    • Twitter
Re: Epic Bughunt Claims Sanity, Human Blood
I'm astonished. Truly. I would have given up hours and hours ago.

Maybe some sort of division error checking should be added to FSO... I'm not a coder so I don't even know if that is possible or plausible.

Yes, well, it was either that, or redoing the mission again from the ground up, hoping that the bug would not trigger, and also hoping that noone else would hit it. Neither of those options was one we wanted to take.
In addition, we had finally gotten a combination of assets that would trigger the bug reliably, that was an opportunity I couldn't let go.

As for the added error checking, I'm kinda astonished that this was not caught before. Modern CPUs are supposed to do just that and start screaming when something as stupid as this happens.

Oh, on a somewhat unrelated note, is this issue connected to the one where sometimes, upon re-trying a mission, the player ship is...gone?

No, probably not. The cause for this, and the resulting error, was very, very specific, and did not seem to be connected to the mission restart part of the code.
Let there be light
Let there be moon
Let there be stars and let there be you
Let there be monsters and let there be pain
Let us begin to feel again
--Devin Townsend, Genesis

 

Offline Commander Zane

  • 212
  • Spoot Knight of Anvils
Re: Epic Bughunt Claims Sanity, Human Blood
I'm trying to remember how this worked, but I recall times where I'd be far from the normal location of a mission (Like 100km), if I'm using padlock view and I activate afterburner the screen wigs out and the results are fairly similar, everything vanishes and the HUD locks on the corner, it takes screwing around while playing in the mission for it to happen but it's just something I recall.

 

Offline Mongoose

  • Rikki-Tikki-Tavi
  • Global Moderator
  • 212
  • This brain for rent.
    • Minecraft
    • Steam
    • Something
Re: Epic Bughunt Claims Sanity, Human Blood
I knew I never used speed-matching for a reason...I just didn't realize the reason was that the engine has acquired its own "ghost." :p

 

Offline Ravenholme

  • 29
  • (d.h.f)
Re: Epic Bughunt Claims Sanity, Human Blood
I knew I never used speed-matching for a reason...I just didn't realize the reason was that the engine has acquired its own "ghost." :p

Likewise, but also because I'm fairly good at just adjusting the throttle on my joystick to match my target myself

>_>

(I admit, I didn't realise there was a match speed function, but I pro'ly wouldn't use it anyway)

Oh, I will say this - you guys are... dedicated. And that's an obscure retail bug, but at least it has now been put to the sword.
Full Auto - I've got a bullet here with your name on it, and I'm going to keep firing until I find out which one it is.

<The_E>   Several sex-based solutions come to mind
<The_E>   Errr
<The_E>   *sexp

 

Offline General Battuta

  • Poe's Law In Action
  • 214
  • i wonder when my postcount will exceed my iq
Re: Epic Bughunt Claims Sanity, Human Blood
I knew I never used speed-matching for a reason...I just didn't realize the reason was that the engine has acquired its own "ghost." :p

Likewise, but also because I'm fairly good at just adjusting the throttle on my joystick to match my target myself

>_>

(I admit, I didn't realise there was a match speed function, but I pro'ly wouldn't use it anyway)

Oh, I will say this - you guys are... dedicated. And that's an obscure retail bug, but at least it has now been put to the sword.

Given that the scenario here in question was a non-combat cruise, match speed is a pretty useful utility function no matter what your opinions of its use in combat may be.

In any case this was a combination of speed-matching with a pairing of events. Rather specific.
« Last Edit: April 16, 2010, 04:55:50 pm by General Battuta »

 

Offline High Max

  • Permanently banned
  • 29
Re: Epic Bughunt Claims Sanity, Human Blood
^^
« Last Edit: May 26, 2010, 01:21:50 am by High Max »
;-)   #.#   *_*   ^^   ^-^   ^_^

 

Offline Commander Zane

  • 212
  • Spoot Knight of Anvils
Re: Epic Bughunt Claims Sanity, Human Blood
The E, is that red and blue reticle thing part of that mouse script? I've been trying to get that to work and well, nobody's giving me any suggestions on the original thread.

 

Offline The E

  • He's Ebeneezer Goode
  • Global Moderator
  • 213
  • Nothing personal, just tech support.
    • Steam
    • Twitter
Re: Epic Bughunt Claims Sanity, Human Blood

As for the added error checking, I'm kinda astonished that this was not caught before. Modern CPUs are supposed to do just that and start screaming when something as stupid as this happens.

Did the modern CPU not do anything because of the fact that the game is still CPU limited? Is that another side effect of it being CPU limited?

No. It's probably a side effect from some optimisation somewhere or other.

See, the problem is, computers in ye olde days used to crash and burn when they encountered a division by zero. Modern Computers and operating systems have built-in checks to prevent this.
In this case, the prevention took the form of giving a special return value called "NaN" (acronym for Not A Number) for the division by zero. The problem was that a) noone checked the divisor before the division took place, thereby ensuring that the data was valid and b) there was no check to make sure the result was valid.

It's a very, very basic programming mistake that, unfortunately, can happen to everyone. It's the equivalent of leaving the house without your keys on the assumption that someone would be there to open the door for you when you return. In most cases, everything will be alright, but in some cases, you'll be locked out of your house.

The E, is that red and blue reticle thing part of that mouse script? I've been trying to get that to work and well, nobody's giving me any suggestions on the original thread.

No, that's the UI for the shield management script.
Let there be light
Let there be moon
Let there be stars and let there be you
Let there be monsters and let there be pain
Let us begin to feel again
--Devin Townsend, Genesis

 

Offline Commander Zane

  • 212
  • Spoot Knight of Anvils
Re: Epic Bughunt Claims Sanity, Human Blood
Well that's pretty neat.

 

Offline Kolgena

  • 211
Re: Epic Bughunt Claims Sanity, Human Blood
Wow. That was a very interesting read.

Something is wrong with me though. I have an urge to take The_E's screenshot, and make a "You divided by 0, didn't you?" demotivational poster of it :/