-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
What should happen
As mentioned, some (possibly dumb; I'm trying to wrap my head around it) observations on SoftGPU.
First off, here's what performance looks like (Linux, 32-bit x86 with SSE2 but without SSE4):
Details
$ perf record ./PPSSPPSDL
$ perf report --stdio | head -n 100
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3M of event 'cycles:u'
# Event count (approx.): 2130104713894
#
# Overhead Command Shared Object Symbol
# ........ ............... ....................... .................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
#
10.82% PoolWorker 0 PPSSPPSDL [.] Rasterizer::DrawTriangleSlice<false, false>
10.67% PoolWorker 1 PPSSPPSDL [.] Rasterizer::DrawTriangleSlice<false, false>
3.02% PPSSPPSDL r600_dri.so [.] 0x002bcd15
3.01% PPSSPPSDL r600_dri.so [.] 0x002bcd23
2.65% PoolWorker 0 PPSSPPSDL [.] Sampler::SampleLinearLevel
2.54% PoolWorker 1 PPSSPPSDL [.] Sampler::SampleLinearLevel
2.23% Emu PPSSPPSDL [.] Rasterizer::DrawTriangleSlice<false, false>
2.07% PoolWorker 0 PPSSPPSDL [.] .L426
2.06% PoolWorker 0 PPSSPPSDL [.] Rasterizer::GetPixelColor
2.04% PoolWorker 1 PPSSPPSDL [.] .L426
1.96% PoolWorker 1 PPSSPPSDL [.] Rasterizer::GetPixelColor
1.75% PoolWorker 0 PPSSPPSDL [.] Sampler::LookupColor
1.67% PoolWorker 0 PPSSPPSDL [.] Rasterizer::SetPixelColor
1.63% PoolWorker 1 PPSSPPSDL [.] Sampler::LookupColor
1.60% PoolWorker 1 PPSSPPSDL [.] Rasterizer::SetPixelColor
1.41% PoolWorker 0 PPSSPPSDL [.] Rasterizer::ApplyTexturing
1.39% PoolWorker 1 PPSSPPSDL [.] Rasterizer::ApplyTexturing
1.36% PoolWorker 0 PPSSPPSDL [.] Sampler::TransformClutIndex
1.27% PoolWorker 0 PPSSPPSDL [.] .L427
1.27% PoolWorker 1 PPSSPPSDL [.] Sampler::TransformClutIndex
1.17% PoolWorker 1 PPSSPPSDL [.] .L427
1.15% PoolWorker 0 PPSSPPSDL [.] Sampler::SampleLinear
1.14% PoolWorker 0 PPSSPPSDL [.] Rasterizer::DrawSinglePixel<false, (GEBufferFormat)1>
1.02% PoolWorker 1 PPSSPPSDL [.] Sampler::SampleLinear
1.01% PoolWorker 0 PPSSPPSDL [.] Sampler::SampleNearest
1.01% PoolWorker 1 PPSSPPSDL [.] Rasterizer::DrawSinglePixel<false, (GEBufferFormat)1>
0.99% PoolWorker 1 PPSSPPSDL [.] Rasterizer::DrawSinglePixel<true, (GEBufferFormat)1>
0.99% PoolWorker 0 PPSSPPSDL [.] Rasterizer::DrawSinglePixel<true, (GEBufferFormat)1>
0.92% PoolWorker 0 PPSSPPSDL [.] .L422
0.91% PoolWorker 0 PPSSPPSDL [.] .L774
0.88% PoolWorker 1 PPSSPPSDL [.] .L422
0.87% PoolWorker 1 PPSSPPSDL [.] .L774
0.85% Emu libm-2.31.so [.] ceilf32
0.74% Emu PPSSPPSDL [.] Lighting::ProcessSIMD<false>
0.74% PoolWorker 1 PPSSPPSDL [.] Sampler::SampleNearest
0.73% PoolWorker 0 PPSSPPSDL [.] .L490
0.67% PoolWorker 1 PPSSPPSDL [.] .L490
0.59% PoolWorker 0 PPSSPPSDL [.] Sampler::GetTextureFunctionOutput
0.54% PoolWorker 0 PPSSPPSDL [.] .L724
0.53% PoolWorker 1 PPSSPPSDL [.] Rasterizer::DrawRectangle
0.53% PoolWorker 0 PPSSPPSDL [.] Rasterizer::DrawRectangle
0.50% PoolWorker 1 PPSSPPSDL [.] .L724
0.49% PoolWorker 0 PPSSPPSDL [.] .L721
0.48% PoolWorker 1 PPSSPPSDL [.] Sampler::GetTextureFunctionOutput
0.48% Emu PPSSPPSDL [.] TransformUnit::ReadVertex
0.46% PoolWorker 1 PPSSPPSDL [.] Rasterizer::CheckDepthTestPassed
0.44% PoolWorker 1 PPSSPPSDL [.] .L721
0.43% PoolWorker 0 PPSSPPSDL [.] Rasterizer::CheckDepthTestPassed
0.43% Emu PPSSPPSDL [.] BinManager::AddTriangle
0.42% PPSSPPSDL r600_dri.so [.] 0x000ef115
0.41% PoolWorker 0 PPSSPPSDL [.] .L775
0.41% PoolWorker 0 PPSSPPSDL [.] __x86.get_pc_thunk.bx
0.41% Emu PPSSPPSDL [.] .L338
0.40% PoolWorker 1 PPSSPPSDL [.] .L775
0.39% Emu PPSSPPSDL [.] SoftGPU::FastRunLoop
0.37% PoolWorker 0 PPSSPPSDL [.] Rasterizer::DrawSprite
0.36% PoolWorker 1 PPSSPPSDL [.] __x86.get_pc_thunk.bx
0.34% Emu PPSSPPSDL [.] Sampler::SampleLinearLevel
0.32% Emu PPSSPPSDL [.] ClipToScreenInternal<true, false>
0.29% PoolWorker 0 PPSSPPSDL [.] Sampler::SampleNearest<1>
0.28% PoolWorker 1 PPSSPPSDL [.] Rasterizer::DrawSprite
0.28% PoolWorker 0 PPSSPPSDL [.] .L44
0.27% Emu PPSSPPSDL [.] .L427
0.27% PoolWorker 1 PPSSPPSDL [.] __x86.get_pc_thunk.ax
0.27% PoolWorker 0 PPSSPPSDL [.] __x86.get_pc_thunk.ax
0.25% PoolWorker 0 PPSSPPSDL [.] .L485
0.25% PoolWorker 1 PPSSPPSDL [.] .L44
0.23% PoolWorker 1 PPSSPPSDL [.] .L485
0.23% Emu PPSSPPSDL [.] Clipper::ProcessTriangle
0.23% PPSSPPSDL r600_dri.so [.] 0x002c0488
0.23% Emu PPSSPPSDL [.] .L285
0.23% Emu PPSSPPSDL [.] Rasterizer::ApplyTexturing
0.22% PoolWorker 0 PPSSPPSDL [.] __x86.get_pc_thunk.si
0.22% Emu PPSSPPSDL [.] Sampler::LookupColor
0.21% PoolWorker 1 PPSSPPSDL [.] __x86.get_pc_thunk.si
0.21% PoolWorker 0 PPSSPPSDL [.] .L377
0.21% PoolWorker 1 PPSSPPSDL [.] Sampler::SampleNearest<1>
0.21% PoolWorker 0 PPSSPPSDL [.] .L748
0.21% Emu PPSSPPSDL [.] Rasterizer::GetPixelColor
0.19% Emu PPSSPPSDL [.] .L426
0.19% PoolWorker 1 PPSSPPSDL [.] .L748
0.19% Emu PPSSPPSDL [.] Sampler::TransformClutIndex
0.18% PoolWorker 1 PPSSPPSDL [.] .L374
0.18% PoolWorker 0 PPSSPPSDL [.] .L374
0.18% Emu PPSSPPSDL [.] ConvertBGRA5551ToABGR1555
0.17% Emu PPSSPPSDL [.] Rasterizer::DrawSinglePixel<false, (GEBufferFormat)1>
0.17% Emu PPSSPPSDL [.] Rasterizer::SetPixelColor
0.17% Emu PPSSPPSDL [.] Rasterizer::CalculateRasterStateFlags
0.17% Emu PPSSPPSDL [.] Sampler::SampleNearest
This is me playing a couple of minutes of "Soulcalibur: Broken Destiny", mostly a 3D scene.
The profile is flat, without --callgraph - it just shows where (and how often) samples land.
Even without JIT, Rasterizer::DrawSinglePixel accounts for surprisingly little - much less than Sampler::SampleLinearLevel (there might be some games that do not use linear much).
A lot of time is spent in Rasterizer::DrawTriangleSlice itself (and whatever is inlined into it).
Now, looking at the code, the biggest thing is that most of processing is done per pixel. The DrawTriangleSlice actually works on 2x2 pixel quads, but then most of the actual processing is per-pixel inside quad. All of state.drawPixel, state.nearest, and state.linear are per-pixel, even when JIT-ed. I assume this is because the first order of business was to get it right, which is easier with per-pixel functions.
Converting it to entirely quad-based seems like a lot of work (especially since it involves both JIT and non-JIT parts, in sync). It also seems lucrative for performance. Even purely scalar quad-based versions are likely to be faster their counterparts, since various if(state.whatever) would be amortized. And the texture lookup seems like the only thing that does not SIMD-ify readily, pre-AVX (and emulating gather in plain scalar+SSE is not even that bad). Going by the names like DrawSinglePixel the idea seems to have been there all along. Aside, I don't have statistics for how common are tiny triangles, and what percentage of 2x2 quads are full.
The Vec3<...> and Vec4<...> are used extensively, but some operations seem missing (notably bitwise stuff). Also, only x86 has SIMD paths for operations on them, and I don't think the default paths auto-vectorize, since auto-vectorization is under -O3, but PPSSPP uses -O2. Not sure if performance on ARM is a concern.
Do not see why
ppsspp/GPU/Software/Rasterizer.cpp
Line 963 in a56f74c
| Vec4<float> wsum_recip = EdgeRecip(w0, w1, w2); |
is per-pixel (per-quad). Normally, w0+w1+w2=const invariant holds for entire triangle (the entire screen, actually), unless some weird per-edge scaling is done. When I tried computing it once per
DrawTriangleSlice there appeared no visible problems.
Who would this benefit
Platform (if relevant)
None
Games this would be useful in
Other emulators or software with a similar feature
No response
Checklist
- Check the latest git build in case it's already implemented.
- Search for other requests of the same feature.