Monday, December 1, 2014

Analysis of GPU performance of mobile SoCs based on GFXBench results

In this post, I am analysing the GPU performance of different GPUs and SoCs based on the results database of GFXBench, one of the leading mobile GPU benchmarks. Apart from providing a GPU performance comparison for different SoCs, GFXBench results provide sufficient detail to get an impression of metrics like fill rate, triangle rate and shader performance, allowing one to draw conclusions about what the bottleneck is in a particular implementation.

GFXBench results table for mobile SoCs


The folowing table show detailed GFXBench 3.0 results for a large number of mobile SoC platforms and devices. The results are grouped by smartphone and tablet devices, and further grouped for similar chips (smartphone table) or in alphabetical order by chip (tablet table).

For a high-resolution version, view/copy/save the image above using the browser.

The same table is shown below, but sorted on the T-Rex Offscreen benchmark score in descending order, which provides a reasonable device-independent indication of GPU performance.

For a high-resolution version, view/copy/save the image above using the browser.


Top-performing SoCs: Apple A8/A8X, Snapdragon 805, NVIDIA Tegra K1 and Exynos 7 Octa


Apple's A8 and A8X SoCs, NVIDIA's Tegra K1 (both the Cortex-A15/A7-based version as well as the NVIDIA Denver-based version) as well as Qualcomm's Snapdragon 805 lead the pack for mobile GPU performance. What most of these chips have in common is a large number of GPU pixel processing cores and a wide DRAM interface (especially in the case of the Apple A8X and Snapdragon 805) to achieve high memory bandwidth. The Apple A8X has been reported by AnandTech to contain an eight cluster PowerVR Series 6 GPU, twice the number of clusters of the GPU inside the Apple A8.

In the OpenGL ES 2.0-based T-Rex offscreen benchmark, the Apple A8X as used in the iPad Air 2 leads, closely followed by the respective versions of Tegra K1 in the HTC Nexus 9 and the NVIDIA Shield Tablet. The Apple A8 and Snapdragon 805 show significantly slower but comparable performance in the T-Rex offscreen benchmark (although still very fast for most purposes), although Snapdragon 805 shows significantly higher low-level metrics such as fillrate, alpha blending bandwidth and shader processing throughput. Snapdragon 805 (with Adreno 420 GPU) has an effective 128-bit memory interface (similar to Apple A8X), which suggests the Apple A8 (with 64-bit memory interface) has greater efficiency within the limitations of the lower memory bandwidth, probably helped by the use of large on-chip caches (including the L3 cache). Samsung's Exynos 7 Octa (Exynos 5433, with Mali-T760 MP6) is somewhat slower than Apple A8 and Snapdragon 805, and so is the slowest of the high-performance processors in terms of GPU power (while being near the lead in terms of CPU performance).

In the OpenGL ES 3.0-based Manhattan benchmark (offscreen, so that the results are largely independent of screen resolution), the Apple A8X and NVIDIA Tegra K1 provide comparable performance (a score just above 2000), while the Snapdragon 805 follows at a considerable distance with a score of about 1200, similar to the score achieved by the Apple A8 inside the iPhone 6 and iPhone 6 Plus. Samsung's Mali-T760 MP6-based Exynos 7 Octa (as represented by the Exynos-based version of the Galaxy Note 4) follows with a score of about 1100.

High-end: Snapdragon 801, Exynos 5 Octa, Apple A7


Qualcomm's Snapdragon 801 with Adreno 330 GPU has been widely used in performance-oriented devices for some time and provides relatively high performance for the segment. Part of the reason for the wide adoption of the high-powered Snapdragon 801 is that Qualcomm has not had a convenient SoC offering intermediate between the Snapdragon 801 and Snapdragon 400 (between which exists a large performance and cost gap), and through its control over the high-performance smartphone market through its patent royalty leverage has been able to convince customers to use the Snapdragon 801 in a wide range of devices (as it did previously with the Snapdragon 800), with the SoC providing more performance than really necessary in many cases.

In the OpenGL ES 2.0-based T-Rex (offscreen) test, Snapdragon 801 scores approximately the same as Apple's previous generation Apple A7 SoC. Samsung's recent Exynos 5 Octa (Exynos 5430, with Mali-T628 MP6) used in the Galaxy Alpha also score about the same. The results for the OpenGL ES 3.0-based Manhattan benchmark are also comparable for these three SoCs.

PowerVR's Rogue Han (G6200) GPU with two clusters inside MediaTek's recent MT6595 does not match the performance of the other high-end chips mentioned above, although still providing perfomance clearly above current and upcoming mid-range solutions. This GPU is also implemented in Allwinner's A80 chip, which shows somewhat lower scores in a benchmark entry for an A80 OptimusBoard development board.

Cost-sensitive SoCs: Snapdragon 410 vs Snapdragon 400 vs MT6582


Rather than showing an evolutionary improvement in GPU performance, the quad-core Cortex-A53-based Snapdragon 410's Adreno 306 GPU actually shows 10% to 20% lower GPU performance than the Adreno 305 in Snapdragon 400 based on metrics like fillrate and the offscreen T-Rex benchmark. This provides evidence that Snapdragon 410 is also a cost-reduction effort in comparison with Snapdragon 400, with a smaller die size for the GPU to reduce cost. This also helps to explain why Qualcomm has aggressively pitched the Snapdragon 410 for low-end 4G smartphones as well as somewhat higher segments, with Snapdragon 410 reported to be Qualcomm's current main volume driver.

When looking at previous generation chips, the Adreno 305 in Snapdragon 400 scores higher than MediaTek's MT6582 in the offscreen T-Rex benchmark (approximately 40% better), while some low-level metrics are slower than MT6582. For example, GFXBench's Driver Overhead score is relatively low for both Snapdragon 400 and Snapdragon 410, reflecting mediocre performance when rendering lots of small objects. The fillrate benchmark is also a little lower than MT6582. The higher T-Rex benchmark performance is probably due to a more optimized and larger cache memory subsystems used in Snapdragon 400 and 410. Exactly how Snapdragon 400/410 compares with the MT6582 and other solutions in other benchmarks and games is beyond the scope of this article.

The next generation of efficient Cortex-A53-based mid-range SoCs: Snapdragon 610 and 615, MT6732 and MT6752


Several new chips for the mid-range performance segment are emerging that use a quad or octa-core Cortex-A53 CPU configuration. The use of Cortex-A53 cores at a relatively high clock frequency is promising to significantly improve power efficiency and cost for this segment (which might previously have required the use of more costly SoCs such as Snapdragon 801). This CPU configuration provides adequate single-core performance and (in the case of an octa-core CPU) great multi-core performance.

Both Qualcomm and MediaTek have introduced SoCs in this class, which also introduce new GPU architectures. Qualcomm's Snapdragon 610 (quad-core Cortex-A53) and Snapdragon 615 (octa-core Cortex-A53) utilize the new Adreno 405 GPU, while MediaTek's quad-core MT6732 and octa-core MT6752 utilize a Mali-T760 MP2 GPU (Mali-T760 has also been adopted by Samsung and others).

T-Rex offscreen performance of Snapdragon 615's Adreno 405 GPU (as represented by an entry for a Lenovo device) with a score of about 850 clearly puts the chip in the performance-oriented segment, since Snapdragon 400 and 410 score not much more than 300 in this benchmark. The OpenGL ES 3.0 Manhattan offscreen benchmark score is similarly significantly higher (about three times higher than Snapdragon 400/410). Low-level metrics are all fairly high for a mid-range device, with only fillrate being limited by the 32-bit DRAM interface.

MediaTek's MT6752 with Mali-T760 MP2 (as represented by a Gionee device entry) shows scores for T-Rex and Manhattan that are comparable with Snapdragon 615. Raw low-level metrics such as ALU, Alpha Blending and fillrate are clearly lower than Snapdragon 615, with only Driver Overhead being superior, suggesting that new ARM optimization technologies such as ARM Framebuffer Compression, Smart Composition Transaction Elimination are already having a positive effect on real-world performance, especially within the bounds of a 32-bit DRAM interface, keeping device cost down.

In terms of cost, the ability of MediaTek's MT6752 to provide good performance for a mid-range device, comparable to Snapdagon 615, with an economical 32-bit DRAM interface, make the chip look very attractive. This also provides evidence that ARM has made somewhat of a breakthrough in terms of performance efficiency with Mali-T760 and the associated optimization techniques mentioned above, mostly based on compression techniques, which will revolutionize performance for economical devices with a 32-bit memory interface that have limited memory bandwidth.

MediaTek's quad-core MT6732 (as represented by an Asus device entry), which also has a Mali-T760 MP2 GPU (but clocked lower than in the MT6752) scores lower but still very respectable (especially for the real-world T-Rex and Manhattan benchmarks) for a mid-range device. There have been reports though suggesting that the Mali-T760's efficiency benefits come at the cost of a relatively large chip die size for a cost-sensitive device, so that a chip such as the MT6732 is not suitable for the high-volume entry-level 4G market (for which Snapdragon 410 is likely to be much more suitable). MediaTek is addressing this with its upcoming MT6735 with cheaper Mali-T720 GPU, which does not appear to offer the bandwidth optimization techniques of the Mali-T760.

MT6592 still has competitive GPU performance


MediaTek's octa-core MT6592 smartphone chip (which was released almost a year ago) with a T-Rex offscreen score in excess of 700 has GPU performance that roughly matches that of the upcoming mid-range chips described above, which are addressing approximately the same segment. The high GPU clock speed of the Mali-450 MP4 GPU probably drives the high scores.

The disadvantages of the MT6592 are a lack of OpenGL ES 3.x support and a likely greater memory bandwidth bottleneck when running at high screen resolutions such as 1920x1080, which also impacts power efficiency. GFXBench's battery life benchmarks when running T-Rex long-term are mediocre for most MT6592-based devices, including devices using a 1280x720 resolution, although it is likely that less demanding 3D applications exhibit better battery life. The Cortex-A7 CPU cores (typically clocked at 1.7 GHz) are also slower than the eight Cortex-A53 cores inside a chip like the MT6752 (but still provide plenty of performance).

RK3288's Mali-T764 GPU: Exact nature unclear


Rockchip's RK3288 is a relatively high performance SoC intended primarily for tablets but currently mainly implemented in devices such as media boxes and development boards. For a long time, Rockchip has advertised its RK3288 SoC as featuring an ARM Mali-T764 GPU. This is confusing because ARM has never announced a GPU with that name. ARM's Mali-T760, also used in new SoCs from other companies such as Exynos 5433 (Exynos 7 Octa) and several new MediaTek SoCs, comes close, and one could assume Rockchip means a Mali-T760 MP4 configuration.

However, in the GFXBench results database, all device entries (mainly representing Android TV box devices, but also including tablets such as the Teclast P90HD) for the RK3288 show a set of GL_EXTENSIONS that is identical to that of devices with a Mali-T628 or Mali-T624 GPU. In particular, the GL_EXT_disjoint_timer_query, GL_EXT_sRGB and GL_EXT_sRGB_write_control extensions, which seem to be associated with Mali-T760-class devices, are missing. Whether this means that the RK3288 actually does not contain a Mali-T760-class GPU but instead an older generation Mali-T62x GPU, or this simply reflects non-optimal drivers, is unclear, but there certainly is a suggestion that the GPU inside the RK3288 is actually of an older (Mali-T62x generation) type.

Earlier, Rockchip was not exactly forthcoming about the exact CPU cores inside the RK3288, which have been proven to be Cortex-A12 instead of Cortex-A17, even though ARM later helped Rockchip by declaring that Cortex-A12 will be also referred to as Cortex-A17 (even though it is technically a different core for which Rockchip was one of the few known customers), and CPU performance from benchmarks such as Geekbench suggests the version of the Cortex-A12 core inside the RK3288 does not quite perform as fast as a real Cortex-A17, clock-for-clock.

While RK3288 does support OpenGL ES 3.0 (as do both Mali-T62x and Mali-T760), GFXBench does not allow the OpenGL ES 3.0 Manhattan benchmark to run on this chip for several TV box devices, which one would normally expect to be possible even if the GPU is technically Mali-T62x class. However, the Teclast P90HD tablet entry does show Manhattan benchmark results, which are consistent with a Mali-T62x MP4 GPU (or perhaps Mali-T7xx) configuration, while also showing reasonable sustained GPU performance and power efficiency.

Other tablet solutions


MediaTek's MT8382 chip for 3G tablets shows performance similar to that of the MT6582 smartphone chip, as expected, with a T-Rex offscreen score of about 220. MediaTek's previous generation WiFi-only MT8125 with PowerVR 544MP shows limited performance, lower than Mali-400 MP2 based designs, and slightly less than its previous-generation MT6589T smartphone chip with a similar GPU.

MediaTek's WiFi-only MT8127 with Mali-450 MP4 for somewhat higher performing tablets, shows higher performance with a T-Rex offscreen score of about 500, higher than the typical score of 350 of the popular RK3188T with Mali-400 MP4, which has commonly been used in tablets. However, the performance of the Mali-450 MP4 GPU appears to be clearly lower than the similar GPU configuration in the octa-core MT6592 smartphone chip, which scores more than 700 in T-Rex offscreen and scores higher in low-level metrics such as fillrate, probably due to the lower GPU clock speed of the MT8127. The MT8135V used in recent Amazon Kindle Fire tablets shows good mid-range performance with a T-Rex offscreen score of 740. This results in good performance given the low screen resolution of the Kindle tablets, but performance is otherwise low for a PowerVR Rogue class GPU.

As mentioned, Rockchip's popular RK3188T chip with Mali-400 MP4 clocked at about 400 MHz scores about 350 in T-Rex offscreen, which is a higher than typical cost-sensitive tablet processors, and also scores higher in low-level metrics such as fillrate.

Thanks to the PowerVR 544 MP2 GPU, Allwinner's aging A31s processor still shows higher performance than Mali-400 MP2-based chips such as MT8382. Allwinner's more recent mass-market chips such as A23 and A33 with Mali-400 MP2 have been slow to come to market, and I haven't yet analyzed their GPU performance, but it is unlikely to be spectacular.

An entry for Leadcore's L1860 with Mali-T628 MP2 GPU shows a T-Rex offscreen score of about 580, and it is compatible with OpenGL ES 3.0. The score reflects a fillrate that might still allow higher resolutions such as 1920x1080 to be used in tablets using this chip, with reasonable but not great GPU performance to be expected, helped by a relatively high GPU clock speed.

Intel' s Atom Z3745 processor for the tablet market shows high performance for its class, with the Acer A1-840 FHD (which uses the higher-end Z3745F variant with 64-bit memory interface) scoring a fairly impressive 1181 in the T-Rex offscreen benchmark. The more commonly used cost-sensitive Z3745G with 32-bit memory interface, as used in the Acer A1-840, scores a still very reasonable 853 in T-Rex offscreen. Both processors have relatively good OpenGL ES 3.0 performance, resulting in relatively high Manhattan benchmark scores for their class (higher than chips such as Snapdragon 610/615).

Finally, the results for the Actions ATM7021, a fairly recent ultra-low-end tablet processor, shows signs of blatant benchmark cheating, with the offscreen (1920x1080) T-Rex score being several times higher than the on-screen score for a device with a screen resolution of 1024x768 (one would expect the offscreen score to be several times lower).

Note about T-Rex benchmark and cost-sensitive GPUs


Because GFXBench's T-Rex benchmark targets a fairly detailed and advanced level of rendering that requires a reasonably high-end GPU for good results, the T-Rex benchmark is likely to understate practical GPU performance for low-end devices. Part of the reason for this is the much lower L2 cache associated with low-end GPU like Mali-400 MP2 and especially Mali-400 MP, which is not likely to be enough to satisfy the T-Rex benchmark's relatively large textures and other demands, resulting in much more expensive external RAM access and a relatively low benchmark score. More typical, less demanding GPU applications of the Angry Birds and Temple Run-type are likely to perform better in relative terms on these platforms (although there will still be variation between chips), and GFXBench's low-level benchmarks provide some information on this.

GFXBench's battery life benchmark is also likely to understate practical battery life for devices such as Mali-400 MP and probably Mali-450 MP because of its higher than typical rendering complexity and relatively large texture working set, with battery life for less demanding GPU applications likely to be significantly better.

Sources: GFXBench results database

Updated December 4, 2014 (Make corrections and add comments about Snapdragon 615's 64-bit memory interface vs MT6752's 32-bit memory interface), add section about T-Rex benchmark's complexity negatively affecting low-end CPU scores.
Updated December 25, 2014 (Correct Snapdragon 615 memory interface width).
Updated December 26, 2014 (Provide slightly updated, sorted GPU benchmark results tables).

No comments: