This is a thread for ideas about how we can improve the performance of DX8 aside from HW TnL.
Currently the single most important optimization we could probably do would be to ensure that we batch up as many polygons (in vertex buffers) as possible before drawing them. We need to get away from DrawPrimitiveUP. EDIT: I've been reading further - big hit on performance is DrawPrimitiveUP combined with the situations where we only send 1 or 2 poly's to be rendered.
My basic idea to do this would be to add a sorting layer between drawing polygons and d3d_DrawPrimitive calls. Essentially we capture all calls to d3d_DrawPrimitive and start loading up a few vertex buffers with data as long as its of one type. When the type changes, we draw the buffer and start loading up a new one. As far as I can tell, this will improve performance as well as making the game friendlier for HW TnL which runs best under such conditions with minimal communication to the graphics card.
Anyone find anything wrong with this or should I go ahead and start implementing it? (this includes complications that I'm probably overlooking)
EDIT: I've just realized that this idea would essentially be very similar to creating a software implemented execute buffer. The real question is, should we do things this way (which is quick and easy to implement relatively) or do a massive overhaul using VB's and such from a very early stage in the pipeline (probably breaking any other API compatibility though without some creative coding)