Hot Chips Samsung has revealed the blueprints to its mystery M1 processor cores at the heart of its S7 and S7 Edge smartphones.
International versions of the top-endAndroid mobiles, which went on sale in March, sport a 14nm FinFET Exynos 8890 system-on-chip that has four standard 1.6GHz ARM Cortex-A53 cores and four M1 cores running at 2.3 to 2.6GHz. Only two M1 cores are allowed to kick it up to the maximum frequency at any one time to avoid draining batteries and overheating pockets. Each M1 typically consumes less than three watts.
The M1, codenamed Mongoose, was designed from scratch in three years by a team in the US, and it runs 32-bit and 64-bit ARMv8-A code. In benchmarks, the Exynos 8890 SoC is behind Apple's iPhone 6S A9 chip in terms of single-core performance, but pushes ahead in multi-core tests.
Mongoose has so far remained under wraps. The lid on the microarchitecture was lifted by Samsung in a presentation at the Hot Chips conference in Cupertino, California, today.
One thing that caught our eye was a mention that the branch predictor uses a neural network to take a good guess at the twists and turns the software will take through its code. If your CPU can predict accurately which instructions an app is going to execute next, you can continue priming the processing pipeline with instructions rather than dumping the pipeline every time you hit a jump.
"The neural net gives us very good prediction rates," said Brad Burgess, who is Samsung's chief CPU architect and is based at the South Korean giant's R&D center in Austin, Texas.
Branch predictor design is one of the most closely guarded secrets in the semiconductor industry. It is a key differentiator between competing architectures, and vendors often don't even bother to patent their technology because it would spill the details in public with no easy or reliable way of proving infringement, given the staggering complexity of branch prediction logic in modern processors.
A basic branch prediction system works by building an internal table that has a two-bit counter per recently seen branch instruction. If a branch is taken, you add one to its counter up to a maximum of three, and if it isn't, you subtract one until you reach zero. So if a branch's counter is stuck at three then it is regularly taken, and you can anticipate that. If it is sitting on zero, it is rarely taken, so you can ignore it and continue fetching and decoding the following instructions.
Intel, AMD and other chip designers don't stop there, though, and use advanced prediction techniques, some of which will resemble neural network designs with decision-making perceptrons. However, because no one really likes to talk about it in detail, descriptions of artificially intelligent branch predictors loiter in computer science papers and academic studies [PDFs].
It is interesting that Samsung has broken ranks by publicly declaring that its prediction engine is a neural network. David Kanter, Linley Group's microprocessor analyst, told us today's state-of-the-art branch prediction systems are based on neural-network-like designs: for example, AMD's Jaguar and Bobcat predictors use similar technology. Burgess, now a veep at Samsung, was formerly chief architect of AMD's Bobcat microarchitecture.
AMD's Zen architect Mike Clark confirmed to us his microarchitecture uses a hashed perceptron system in its branch prediction. "Maybe I should have called it a neural net," he added.
So, Samsung says it has a neural network branch predictor in its mobile phone chip. Intel and AMD quietly have ones too in their desktop and server processors. Yet again, we're reminded that today's top smartphones are yesterday's PCs in your pocket. Perhaps Samsung highlighted its neural network because it has done something novel with the design; perhaps because it sounds cool.
Here's a rundown of what else is under the hood in Samsung's M1 cores:
Each core has a 64KB instruction cache. Its caches are described as low latency and low power
The decode stage can crunch four instructions per cycle. Most instructions map directly to a single micro-op. The core is capable of full out-of-order execution including loads and stores, with a multi-stream and multi-stride prefetcher
The integer processing unit has a 96-entry physical register file, which has a lower latency than a virtual register file that is remapped on the fly
The floating-point unit needs five cycles to perform a multiply-and-accumulate operation, four for a multiply and three for an add
The load-store unit talks to a 32KB data cache
This is a basic view of the pipeline with two branch prediction stages, three fetch, three decode, two register renaming, one dispatch, then four or six stages of execution, followed by arbitration and cache tagging
Finally, the floor plans of one core and the whole M1 quad-core cluster. The TBW is the TLB walker. There is a lot of logic bussing data from the CPUs to the 2MB 16-way shared L2 cache, which is split into four banks. Accessing the L2 incurs a 22-cycle latency
Finally, for those wondering if the M1 will form the basis of a Samsung ARM data center processor design, Burgess said the M1 isn't really suited for that: it lacks a large L2 cache and a beefy vectorization unit, for example. ®