@Uchuujinsan: Meh. I'm coming from a C#/Java background, wherein stuff works the way I'd expect. And where private + virtual isn't a thing.
Anyway, gotta make GPU physics go faster
My original scheme had two shader programs: one which evaluated constraints and output the new linear & angular velocity data for the two rigid bodies involved, and a second which updated a master array of rigid bodies' velocity data, copying from the previous state or from the outputs of the first shader according to an index.
My second and current scheme has one shader program which is kind of a combination of the two shader programs in the first scheme. It updates rigid bodies' entries in the master array, using an index to determine which ones should be copied from the previous state and which ones are changed. But instead of copying the changed ones from another array, it does the computation on the spot. The disadvantage of this is that it has to process each constraint twice... but it was still faster than the first scheme.
I've got an idea for a third scheme which I think might be even faster. It would be just a single shader program, with a geometry shader to evaluate the constraints and emit what are basically RGBA32F pixels/texels... to write a specific vec4 to a specific pixel/texel of the master velocity data array. If the constraint doesn't apply an impulse, it wouldn't emit any pixels. If it applies an impulse, it would write the new linear and angular velocities of the two involved rigid bodies to the appropriate places in the array.
One thing that has been an issue with both of the schemes I've tried, and I believe will still be an issue with the third scheme, is that I can't use the same buffer as both an input and an output. So I've been having to switch back and forth between two buffers instead. I don't know for sure, but I somewhat suspect this is contributing to how long it takes to process.
I have determined that the amount of time it takes is basically proportional to the number of batches processed, i.e.
iterations * batches.size() in the snippet below. This seems to be independent of the number of rigid bodies, or the average number of constraints per batch.
for(unsigned int i = 0; i < iterations; ++i)
for(unsigned int j = 0; j < batches.size(); ++j)
{
// do constraint shader stuff
glActiveTexture(GL_TEXTURE1);
glBindTexture(GL_TEXTURE_BUFFER, active_vtex);
glTexBufferEXT(GL_TEXTURE_BUFFER, GL_RGBA32F, active_vdata);
glUniform1i(u_velocity_data, 1);
GLDEBUG();
// set up outputs for transform feedback
glBindBufferRange(GL_TRANSFORM_FEEDBACK_BUFFER, 0, inactive_vdata, 0, num_rigid_bodies * 4 * sizeof(float));
glBindBufferRange(GL_TRANSFORM_FEEDBACK_BUFFER, 1, inactive_vdata, num_rigid_bodies * 4 * sizeof(float), num_rigid_bodies * 4 * sizeof(float));
GLDEBUG();
glBeginTransformFeedback(GL_POINTS);
glDrawArrays(GL_POINTS, num_rigid_bodies * j, num_rigid_bodies);
glEndTransformFeedback();
GLDEBUG();
glFlush();
// change which direction the copying is going (back and forth)... can't use one buffer as both input and output or it will be undefined behavior!
swap(active_vdata, inactive_vdata);
swap(active_vtex, inactive_vtex);
}
Little help?
Edit: I was forced to replace
glFlush(); with
glFinish();, and now it is even slower
Edit II: Apparently I can leave those all as
glFlush, and add a
glFinish before calling
glGetBufferSubData (which is one of the first things I do immediately after the snippet I pasted)... so it's not
as bad. But I still need to call
glFlush every iteration, and I still need to swap back and forth between two buffers