Originally posted by Sesquipedalian
Actually, "768 is a sum of two power-of-two numbers" is the only true statement in that post, I believe.
Okay, I was wrong about hardware limitations and API limitations, to some degree - You need to look for D3DPTEXTURECAPS_NONPOW2CONDITIONAL to see if your card supports non-power-of-two textures, and hardware older than the Matrox G400 (which includes, I believe, the Riva TNT 2 line, which is popular) most definitely doesn't have that capability, therefore, non power-of-two textures must be either scaled to power-of-two sizes, or stored in the nearest larger power-of-two size with a space-waste.
The origin of the power-of-two limit is of course in the way registers are handled internally, which means that if you force power-of-two sizes, you can do a shift instead of multiply per each pixel drawn, which was a considerable gain, most likely, when decided upon in the original 3D accelerators (though it actually slowed down my software rasterizer on my old Pentium system, since shifts take both pipelines on a Pentium).
Edit: The point is, forcing power-of-two sizes was simply a necessity, for speed and transistor-count purposes, but recent hardware seems to have spared transistors for it. Of course, power-of-two sizes are also cache-friendly, which is good.
Why is the shift used? When a polygon is drawn, the 3D coordinates are transformed into screen-space (2D) coordinates.
Then, a procedure called scan-converting of the polygon occurs, in which, basically, a polygon is dissected into many horizontal lines, by going through the edges of the polygon.
now that you have many horizontal lines (with texture coordinates, lighting, etc. computed for the edges), you have the most critical inner-loop, which draws them basically, it interpolates lighting values, and texture coordinates, reads from the texture, according to the interpolated texture coordinates, modulates with the lighting, and writes to the destination buffer.
The interpolation of texture coordinates is a simple pair of additions, but the resulting texture coordinates are an X,Y pair. A simple point on the texture. That needs to be converted into an offset inside the memory in which the texture is stored (The texture is generally stored as a simple linear array). the conversion is generally OFFSET = (X + Y*TextureWidth)*BytesPerPixel. BytesPerPixel is always a power of two, so that's naturally a shift, and if TextureWidth is a power of two, that's a shift also. Ta Da!

What an essay... Look at what you've done to me, Sesquipedalian, turning me into a nattering nitpicker!
Edit 2: I also forgot to mention that the scan-conversion process is what causes the requirement for triangles to be passed to hardware in general, since convex objects can be scan-converted much faster, and triangles are guaranteed to be convex (Unless there's black magic involved

). That's also why Triangle Strips and Fans exist - Since the hardware can use the same scan-conversion data for more than one triangle (as triangles sare edges), stuff goes a whole lot faster.