This is rubbish! Absolute rubbish! Multi-core CPUs are the wave of the future!
Intel demoed an 80-core CPU running on an FPGA years ago - even though it was running on an FPGA, it could run XP! It didn't run very fast, but because it was a prototype board all the signals had to go through a bunch of gates. Once they get a design that uses real transistors instead of gates it will be way faster.
These CPUs are coming out really soon. I even bet tomorrow they'll be announcing the
Core 3 32-o with 32 cores. Only this time it won't be a prototype - oh no, it will be a monster 4 GHz, 32-core computing machine. Should be really easy to do once they get their 32 nanometer process up and running. With the power savings I bet it will run with less power than
one of their 90 nanometer octo-cores.
The FS2SCP cannot ignore this trend. We need to be able to take advantage of these CPUs NOW, before it gets beaten out of the gate by all the other space simulator projects, open- and closed-source. I hear the
Descent Rebirth team has a huge update coming that will take Descent's ~15 year-old engine and make it look and run ten times better than anything currently on the market, or that will soon be on the market. And you know how they're going to do that? By taking advantage of multiple cores, and the special SSE5.1 instructions built into these upcoming Intel CPUs. Damn, I hear even
Orbiter is adding space combat. It's going to be awesome.
We have to make the FS2 engine more
parallelism if we want to keep up. Our target hould be to get it running perfectly on one of these soon-to-be released Intel CPUs. I know your instinct is going to be to make use of
that awesome feature that takes all 32 cores and reconfigures them into one giant core to rule them all. But that's wrong! That's what all the other teams want you to try! What happens when you have that many instructions in flight at that kind of speed (we're talking terahertz here) is that you strangle even the best I/O bus money can buy. That's right, the
FSB,
HyperTransport, even
Intel's new QuickPath can't handle the raw amount of data put out by a CPU in this mode. Sure, there are some
narrow scientific uses nobody cares about that keep all the traffic on-die and require no access to the memory controller or the graphics controller, but FS2 is not one of those applications. That CPU is generating thousands of teraflops per second, and it needs to send those teraflops to the graphics card as fast as it can. The problem is that graphics cards have to process those teraflops really fast to put them on the screen, right? And ATi only
just came out with a video card that can only handle two teraflops per minute. So clearly it can't keep up with these Intel CPUs. The other teams know that, and they're already looking for another way to harness the power of these new CPUs.
Fortunately, I was recently fired by a team working on a
very promising space project which I'm not allowed to name. I know the method they're using to optimize to put more parallelisms in their code! The trick isn't to try and get all of the cores on the CPU doing the same thing at the same time to make it faster, because each core will just end up waiting its turn for the
L2 cache, no matter how big it is. No, the trick here is like this: say you have a 32-core CPU. You load your executable into memory starting at address 0x00000000. Yeah, I know, that's really reserved kernel memory or something like that, but for the sake of argument let's pretend that the OS uses an addressing scheme that loads the executable at 0x00000000. Pretend that the OS realized I had this awesome idea and made all this space for me by throwing the kernel on
the stack (which usually starts at 0xFFFFFFFF and grows down). You know what I mean here.
So anyway, you have core 0 start executing at 0x00000000. Then you have core 1 start executing at 1/s, where s is the size (in the address space) of your program. Core 2 starts at 2/s, core 3 at 3/s, and so on and so forth. All of the code in the FS2 executable will be executed
much faster. How much faster you ask? Try 32 times faster! That's right, we've finally found a way to get linear speedup out of multiple cores!
But how is this possible? Every computer science paper you've read, even the
most optimistic marketing you've seen have shown less than linear speedup as you add codes. What's the secret? The secret is the
IOMMU, or I/O Memory Management Unit. It was initially developed because it would be useful for Virtualization. The problem is that even with these massively powerful processors, virtual reality is years away yet. At least two. So in the meantime, we've found another use for this really fast memory management device. When it is implemented on the same die as the CPU, it can be used to pass messages between all 32 cores. The device literally stands on its head (okay not literally, it would fall out of its socket) to pass just-in-time messages from one CPU to another, supplying it with information just as it needs it.
So, I realize I kind of rambled on a bit here, but this new technology gets me excited. In short, here's the steps I'm proposing the SCP team take:
1. Design an engine which dynamically takes different slices of the game code and passes it to idle CPUs as needed.
2. Hack together an IOMMU driver to make the performance of the above solution acceptable. The goal should be no less than linear CPU scaling as CPUs are added. I think the best we can hope for is 2xlinear, if we find a way to take advantage of
reverse-hyperthreading.
3. Make a complete switch to ray-tracing because rasterized graphics do so poorly when you spread the load out over more processing cores (just look at the size of GT200, and how it's beat in some cases by the smaller RV770!).
4. Take advantage of the iSSE5.1 instructions, which are the most important part of what I just explained.
5. Beat the DXX-Rebirth team to market, because they're going to use all their CPU power to make sure their game has an awesome story and lots of plot twists.