This is a rather complex change, adding a lot of code but ideally cleaning up quite a bit as we go.
Within a batch (single frame), there are multiple distinct Mali job types: SET_VALUE, VERTEX, TILER, FRAGMENT for the few that we emit right now (eventually more for compute and geometry shaders). Each hardware job has a mali_job_descriptor_header, which contains three fields of interest: job index, a dependencies list, and a next job pointer.
The next job pointer in each job is used to form a linked list of submitted jobs. Easy enough.
The job index and dependencies list, however, are used to form a dependency graph (a DAG, where each hardware job is a node and each dependency is a directed edge). Internally, this sets up a scoreboarding data structure for the hardware to dispatch jobs in parallel, enabling (for example) vertex shaders from different draws to execute in parallel while there are strict dependencies between tiling the geometry of a draw and running that vertex shader.
For a while, we got by with an incredible series of total hacks, manually coding indices, lists, and dependencies. That worked for a moment, but combinatorial kaboom kicked in and it became an unmaintainable mess of spaghetti code.
We can do better. This commit explicitly handles the scoreboarding by providing high-level manipulation for jobs. Rather than a command like "set dependency #2 to index 17", we can express quite naturally "add a dependency from job T on job V". Instead of some open-coded logic to copy a draw pointer into a delicate context array, we now have an elegant exposed API to simple "queue a job of type XYZ".
The design is influenced by both our current requirements (standard ES2 draws and u_blitter) as well as the need for more complex scheduling in the future. For instance, blits can be optimized to use only a tiler job, without a vertex job first (since the screen-space vertices are known ahead-of-time) -- causing tiler-only jobs. Likewise, when using transform feedback with rasterizer discard enabled, vertex jobs are created (to run vertex shaders) with no corresponding tiler job. Both of these cases break the original model and could not be expressed with the open-coded logic. More generally, this will make it easier to add support for compute shaders, geometry shaders, and fused jobs (an optimization available on Bifrost).
Incidentally, this moves quite a bit of state from the driver context to the batch, which helps with Rohan's refactor to eventually permit pipelining across framebuffers (one important outstanding optimization for FBO-heavy workloads).
v2: Add comment explaining the meaning of "primary batch" as suggested by Tomeu (trivial - not reviewed).
f0854745fd1 panfrost: Implement command stream scoreboarding
src/gallium/drivers/panfrost/meson.build | 1 +
src/gallium/drivers/panfrost/pan_context.c | 132 +-------
src/gallium/drivers/panfrost/pan_context.h | 20 +-
src/gallium/drivers/panfrost/pan_drm.c | 6 +-
src/gallium/drivers/panfrost/pan_job.c | 18 +-
src/gallium/drivers/panfrost/pan_job.h | 71 ++++
src/gallium/drivers/panfrost/pan_scoreboard.c | 453 ++++++++++++++++++++++++++
7 files changed, 558 insertions(+), 143 deletions(-)