Skip to content

SoftGPU perf opportunities #17613

@fp64

Description

@fp64

What should happen

As mentioned, some (possibly dumb; I'm trying to wrap my head around it) observations on SoftGPU.

First off, here's what performance looks like (Linux, 32-bit x86 with SSE2 but without SSE4):

Details
$ perf record ./PPSSPPSDL
$ perf report --stdio | head -n 100
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3M of event 'cycles:u'
# Event count (approx.): 2130104713894
#
# Overhead  Command          Shared Object            Symbol                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
# ........  ...............  .......................  .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
#
    10.82%  PoolWorker 0     PPSSPPSDL                [.] Rasterizer::DrawTriangleSlice<false, false>
    10.67%  PoolWorker 1     PPSSPPSDL                [.] Rasterizer::DrawTriangleSlice<false, false>
     3.02%  PPSSPPSDL        r600_dri.so              [.] 0x002bcd15
     3.01%  PPSSPPSDL        r600_dri.so              [.] 0x002bcd23
     2.65%  PoolWorker 0     PPSSPPSDL                [.] Sampler::SampleLinearLevel
     2.54%  PoolWorker 1     PPSSPPSDL                [.] Sampler::SampleLinearLevel
     2.23%  Emu              PPSSPPSDL                [.] Rasterizer::DrawTriangleSlice<false, false>
     2.07%  PoolWorker 0     PPSSPPSDL                [.] .L426
     2.06%  PoolWorker 0     PPSSPPSDL                [.] Rasterizer::GetPixelColor
     2.04%  PoolWorker 1     PPSSPPSDL                [.] .L426
     1.96%  PoolWorker 1     PPSSPPSDL                [.] Rasterizer::GetPixelColor
     1.75%  PoolWorker 0     PPSSPPSDL                [.] Sampler::LookupColor
     1.67%  PoolWorker 0     PPSSPPSDL                [.] Rasterizer::SetPixelColor
     1.63%  PoolWorker 1     PPSSPPSDL                [.] Sampler::LookupColor
     1.60%  PoolWorker 1     PPSSPPSDL                [.] Rasterizer::SetPixelColor
     1.41%  PoolWorker 0     PPSSPPSDL                [.] Rasterizer::ApplyTexturing
     1.39%  PoolWorker 1     PPSSPPSDL                [.] Rasterizer::ApplyTexturing
     1.36%  PoolWorker 0     PPSSPPSDL                [.] Sampler::TransformClutIndex
     1.27%  PoolWorker 0     PPSSPPSDL                [.] .L427
     1.27%  PoolWorker 1     PPSSPPSDL                [.] Sampler::TransformClutIndex
     1.17%  PoolWorker 1     PPSSPPSDL                [.] .L427
     1.15%  PoolWorker 0     PPSSPPSDL                [.] Sampler::SampleLinear
     1.14%  PoolWorker 0     PPSSPPSDL                [.] Rasterizer::DrawSinglePixel<false, (GEBufferFormat)1>
     1.02%  PoolWorker 1     PPSSPPSDL                [.] Sampler::SampleLinear
     1.01%  PoolWorker 0     PPSSPPSDL                [.] Sampler::SampleNearest
     1.01%  PoolWorker 1     PPSSPPSDL                [.] Rasterizer::DrawSinglePixel<false, (GEBufferFormat)1>
     0.99%  PoolWorker 1     PPSSPPSDL                [.] Rasterizer::DrawSinglePixel<true, (GEBufferFormat)1>
     0.99%  PoolWorker 0     PPSSPPSDL                [.] Rasterizer::DrawSinglePixel<true, (GEBufferFormat)1>
     0.92%  PoolWorker 0     PPSSPPSDL                [.] .L422
     0.91%  PoolWorker 0     PPSSPPSDL                [.] .L774
     0.88%  PoolWorker 1     PPSSPPSDL                [.] .L422
     0.87%  PoolWorker 1     PPSSPPSDL                [.] .L774
     0.85%  Emu              libm-2.31.so             [.] ceilf32
     0.74%  Emu              PPSSPPSDL                [.] Lighting::ProcessSIMD<false>
     0.74%  PoolWorker 1     PPSSPPSDL                [.] Sampler::SampleNearest
     0.73%  PoolWorker 0     PPSSPPSDL                [.] .L490
     0.67%  PoolWorker 1     PPSSPPSDL                [.] .L490
     0.59%  PoolWorker 0     PPSSPPSDL                [.] Sampler::GetTextureFunctionOutput
     0.54%  PoolWorker 0     PPSSPPSDL                [.] .L724
     0.53%  PoolWorker 1     PPSSPPSDL                [.] Rasterizer::DrawRectangle
     0.53%  PoolWorker 0     PPSSPPSDL                [.] Rasterizer::DrawRectangle
     0.50%  PoolWorker 1     PPSSPPSDL                [.] .L724
     0.49%  PoolWorker 0     PPSSPPSDL                [.] .L721
     0.48%  PoolWorker 1     PPSSPPSDL                [.] Sampler::GetTextureFunctionOutput
     0.48%  Emu              PPSSPPSDL                [.] TransformUnit::ReadVertex
     0.46%  PoolWorker 1     PPSSPPSDL                [.] Rasterizer::CheckDepthTestPassed
     0.44%  PoolWorker 1     PPSSPPSDL                [.] .L721
     0.43%  PoolWorker 0     PPSSPPSDL                [.] Rasterizer::CheckDepthTestPassed
     0.43%  Emu              PPSSPPSDL                [.] BinManager::AddTriangle
     0.42%  PPSSPPSDL        r600_dri.so              [.] 0x000ef115
     0.41%  PoolWorker 0     PPSSPPSDL                [.] .L775
     0.41%  PoolWorker 0     PPSSPPSDL                [.] __x86.get_pc_thunk.bx
     0.41%  Emu              PPSSPPSDL                [.] .L338
     0.40%  PoolWorker 1     PPSSPPSDL                [.] .L775
     0.39%  Emu              PPSSPPSDL                [.] SoftGPU::FastRunLoop
     0.37%  PoolWorker 0     PPSSPPSDL                [.] Rasterizer::DrawSprite
     0.36%  PoolWorker 1     PPSSPPSDL                [.] __x86.get_pc_thunk.bx
     0.34%  Emu              PPSSPPSDL                [.] Sampler::SampleLinearLevel
     0.32%  Emu              PPSSPPSDL                [.] ClipToScreenInternal<true, false>
     0.29%  PoolWorker 0     PPSSPPSDL                [.] Sampler::SampleNearest<1>
     0.28%  PoolWorker 1     PPSSPPSDL                [.] Rasterizer::DrawSprite
     0.28%  PoolWorker 0     PPSSPPSDL                [.] .L44
     0.27%  Emu              PPSSPPSDL                [.] .L427
     0.27%  PoolWorker 1     PPSSPPSDL                [.] __x86.get_pc_thunk.ax
     0.27%  PoolWorker 0     PPSSPPSDL                [.] __x86.get_pc_thunk.ax
     0.25%  PoolWorker 0     PPSSPPSDL                [.] .L485
     0.25%  PoolWorker 1     PPSSPPSDL                [.] .L44
     0.23%  PoolWorker 1     PPSSPPSDL                [.] .L485
     0.23%  Emu              PPSSPPSDL                [.] Clipper::ProcessTriangle
     0.23%  PPSSPPSDL        r600_dri.so              [.] 0x002c0488
     0.23%  Emu              PPSSPPSDL                [.] .L285
     0.23%  Emu              PPSSPPSDL                [.] Rasterizer::ApplyTexturing
     0.22%  PoolWorker 0     PPSSPPSDL                [.] __x86.get_pc_thunk.si
     0.22%  Emu              PPSSPPSDL                [.] Sampler::LookupColor
     0.21%  PoolWorker 1     PPSSPPSDL                [.] __x86.get_pc_thunk.si
     0.21%  PoolWorker 0     PPSSPPSDL                [.] .L377
     0.21%  PoolWorker 1     PPSSPPSDL                [.] Sampler::SampleNearest<1>
     0.21%  PoolWorker 0     PPSSPPSDL                [.] .L748
     0.21%  Emu              PPSSPPSDL                [.] Rasterizer::GetPixelColor
     0.19%  Emu              PPSSPPSDL                [.] .L426
     0.19%  PoolWorker 1     PPSSPPSDL                [.] .L748
     0.19%  Emu              PPSSPPSDL                [.] Sampler::TransformClutIndex
     0.18%  PoolWorker 1     PPSSPPSDL                [.] .L374
     0.18%  PoolWorker 0     PPSSPPSDL                [.] .L374
     0.18%  Emu              PPSSPPSDL                [.] ConvertBGRA5551ToABGR1555
     0.17%  Emu              PPSSPPSDL                [.] Rasterizer::DrawSinglePixel<false, (GEBufferFormat)1>
     0.17%  Emu              PPSSPPSDL                [.] Rasterizer::SetPixelColor
     0.17%  Emu              PPSSPPSDL                [.] Rasterizer::CalculateRasterStateFlags
     0.17%  Emu              PPSSPPSDL                [.] Sampler::SampleNearest

This is me playing a couple of minutes of "Soulcalibur: Broken Destiny", mostly a 3D scene.

The profile is flat, without --callgraph - it just shows where (and how often) samples land.

Even without JIT, Rasterizer::DrawSinglePixel accounts for surprisingly little - much less than Sampler::SampleLinearLevel (there might be some games that do not use linear much).
A lot of time is spent in Rasterizer::DrawTriangleSlice itself (and whatever is inlined into it).

Now, looking at the code, the biggest thing is that most of processing is done per pixel. The DrawTriangleSlice actually works on 2x2 pixel quads, but then most of the actual processing is per-pixel inside quad. All of state.drawPixel, state.nearest, and state.linear are per-pixel, even when JIT-ed. I assume this is because the first order of business was to get it right, which is easier with per-pixel functions.

Converting it to entirely quad-based seems like a lot of work (especially since it involves both JIT and non-JIT parts, in sync). It also seems lucrative for performance. Even purely scalar quad-based versions are likely to be faster their counterparts, since various if(state.whatever) would be amortized. And the texture lookup seems like the only thing that does not SIMD-ify readily, pre-AVX (and emulating gather in plain scalar+SSE is not even that bad). Going by the names like DrawSinglePixel the idea seems to have been there all along. Aside, I don't have statistics for how common are tiny triangles, and what percentage of 2x2 quads are full.

The Vec3<...> and Vec4<...> are used extensively, but some operations seem missing (notably bitwise stuff). Also, only x86 has SIMD paths for operations on them, and I don't think the default paths auto-vectorize, since auto-vectorization is under -O3, but PPSSPP uses -O2. Not sure if performance on ARM is a concern.

Do not see why

Vec4<float> wsum_recip = EdgeRecip(w0, w1, w2);

is per-pixel (per-quad). Normally, w0+w1+w2=const invariant holds for entire triangle (the entire screen, actually), unless some weird per-edge scaling is done. When I tried computing it once per DrawTriangleSlice there appeared no visible problems.

Who would this benefit

Platform (if relevant)

None

Games this would be useful in

Other emulators or software with a similar feature

No response

Checklist

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions