AMD Zen Architecture Interview with Sam Naffziger & Chief Architect Mike Clark
Posted on July 23, 2017
Every now and then, a content piece falls to the wayside and is archived indefinitely -- or just lost under a mountain of other content. That’s what happened with our AMD Ryzen pre-launch interview with Sam Naffziger, AMD Corporate Fellow, and Michael Clark, Chief Architect of Zen. We interviewed the two leading Zen architects at the Ryzen press event in February, had been placed under embargo for releasing the interview, and then we simply had too many other content pieces to make a push for this one.
The interview discusses topics of uOp cache on Ryzen CPUs, power optimizations, shadow tags, and victim cache. Parts of the interview have been transcribed below, though you’ll have to check the video for discussion on L1 writeback vs. writethrough cache designs and AMD’s shadow tags.
Our first question to Clark was of the Zen operation cache and Ryzen’s micro-op cache. uOp cache was a major discussion piece during the pre-launch press briefing, so seeking more detail on the role of a uOp cache in Zen was the first objective:
Michael Clark: “One of the hardest problems of trying to build a high-frequency x86 processor is that the instructions are a variable length. That means to try to get a lot of them to dispatch in a wide form, it’s a serial process. To do that, generally we’ve had to build deep pipelines, very power hungry to do that. We actually call it an op-cache because it stores [instructions] in a more dense format than the past; what it does is, having seen [the instructions] once, we store them in this op-cache with those boundaries removed. When you find the first one, you find all its neighbors with it. We can actually put them in that cache 8 at a time so we can pull 8 out per cycle, and we can actually cut two stages off that pipeline of trying to figure out the instructions. It gives us that double-whammy of a power savings and a huge performance uplift.”
Naffziger jumped in to add:
Sam Naffziger: “X86 decode, the variable length instructions, are very complex -- requires a ton of logic. I mean, guys make their career doing this sort of thing. So you pump all these x86 instructions in there, burns a lot of power to decode them all, and in our prior designs every time you encounter that code loop you have to go do it again. You have this expensive logic block chunking away. Now we just stuff those micro-ops into the op-cache, all the decoding done, and the hit-rate there is really high [Clark: up to 90% on a lot of workloads], so that means we’re only doing that heavy-weight decode 10% of the time. It’s a big power saver, which is great. The other thing we did is the write-back L1 Cache. We aren’t consistently pushing the data through to the L2, there are some simplifications if you do that, but we added the complexity of a write-back so now we keep stuff way more local. We’re not moving data around, because that wastes power.
“One of the things that I highlighted earlier today is the effort the team put in to squeeze down the overhead power. In a CPU core, these things running over 4GHz, very hard to get the clocks out to all those billions of transistors with picosecond accuracy. Takes a lot of wires, a lot of big drivers to do that. We invest a ton of engineering to optimize that down and cut 40% out of that clock network. We’d worked really hard cutting power out in prior generations, but we got 40% more this time. We also optimized the sequential elements that move the data in between the logic. They’re kind of like the glue that holds the logic together. We optimized the crap out of those things, made them really small and power efficient, and the net is that when you look at the power breakdown for the core -- most processors, you have clock power, sequential power, the little bit that’s the logic gates doing actual work.”
Naffziger continued to tie the power consumption to the efficiency and operations execution:
Naffziger: “What we did on this core is we grew that logic gate percentage by 35%, so now it’s bigger than the other two overhead pieces. Those are a couple of the things: Efficient microarchitecture, allocating more power to useful work, and a bunch of other things that got all that IPC enhancement. We talked 52%+ IPC, a rule of thumb with experienced processor architects is that you pretty much pay 1% power for 1% IPC. It’s easier to do a lot worse than that. You push your designers, you’re gonna grow power as you push more instructions through the pipe. Makes sense. You’re doing more work, you’re switching more gates, eating more instructions, running that decoder -- burns power. What we did here, we burn no additional power for all that increased IPC. That’s a hell of an accomplishment.”
The rest of the video explains L1 writethrough versus writeback cache (if curious about that distinction) and AMD’s shadow tags.
We’re glad this video resurfaced. Working on cleaning our rendering machine storage, we discovered the video never made it to publication, polished off the intro, and re-rendered it. This is pre-camera upgrade for us so you’ll have to bear with the 1080p and choppier shots, but the content contained within is excellent.
Editorial: Steve Burke
Video: Keegan Gallick & Andrew Coleman