PDA

View Full Version : AMD's dual core



jasong
04-21-2005, 06:11 PM
I haven't read the article as of this posting, but you can find it here. (http://www.hexus.net/content/reviews/review.php?dXJsX3Jldmlld19JRD0xMTI5)

they've got the article covered in an advertisement, but copy and paste still works, NYAAAH, NYAAAH!!!!



It's all about the dual-core these days. From Intel launching a split-core Smithfield processor range, from the Pentium Extreme Edition 840 down, to today's official launch of AMD's dual-core Opteron range, all the hype in the processor world revolves around putting more than one core inside a processor package. The current approaches of AMD and Intel to multiple core processors aren't new, with other CPU vendors having multi-core processor designs for a long time. However those processors aren't x86 designs and fall outside of the usual HEXUS remit, and the eye of the consumer and many workstation and server vendors.

Before I dive into things, there's some concepts and terminology to digest, that I'll be using to describe multiple core processors in this article. Intel's Smithfield processor is made up of two separate processor dies on the same physical package, under the same heatspreader. So if you were to take the heatspreader off of a Pentium D or a Pentium Extreme Edition 840, you'd see two separate dies. AMD's approach is different, both cores occupying one die. Take the heatspreader off the Opteron 875 that I'll show you soon and you'd see just one die, albeit one which houses two processor cores.

While technically they're both dual-core processors, both current approaches shows you there's significant difference in the two ways of getting more than one processor core into one physical package. Both ways of doing chip multiprocessing, where you place multiple physical cores on the same package, have the cores sharing an interconnect to the rest of the system. The cores on Intel's Smithfield simply ride the same GTL+ bus that connects the processors to the memory controller and the rest of the system. AMD's approach has the cores sharing a memory controller and HyperTransport links to the rest of the system.

Regardless of the approach, multi-core processors are designed to exploit threading models and thread-level parallelism on modern operating systems. Common sense tells you that doing more work in the same amount of time on a computer system will increase performance. That's the driving mantra behind all multi processor systems, where if there's another complete set of execution resources on another processor to use, you should always do so wherever possible.

The barrier to massive widespread adoption of multi-processor systems is the software. Most consumer software is single-threaded where the application does all its work in one thread, which can only be run on one processor at any given time. If you've got multiple processors, the single-threaded application you're using will ignore all but one of the CPUs. However, with x86 CPU vendors running out of ways to increase single-threaded performance with essentially clock speed and larger caches, because of the current process limits that the CPUs are built with, they've had to go wide, building multi-core processors that keep sane clock speeds and cache sizes, but which let you run multiple threads of execution.

The mass-market introduction of multi-core x86 processors should therefore force operating system and application vendors to seriously consider multi-threading wherever possible. There's massive scope for parallelising many consumer software applications, so pervasive multi-core processors in systems worldwide will only increase the number of well-written multi-threaded applications, which in turn will help drive the reverse: further adoption of PCs, workstations and servers with multi-core processors because more applications are available to exploit them. And while single-threaded performance can theoretically rise just by having the OS you're running have access to more than one CPU, the large gains are with explicitly multi-threaded applications running on that same OS.

In the server and workstation world, they're most of the way there already with the software, and multi-core processors will simply allow vendors to pack more processing power into the same space, which affords them benefits in terms of power (you can double your processing power in the same number of chassis', with dual-core processors, so you're not spending any more of a power budget on things like disks and memory), size (to double your processing power in the same size of chassis, just use a dual-core processor) and therefore money.

Hopefully the benefits of a multi-core processor, like Intel's new dual-core desktop and workstation processors, and AMD's dual-core Opteron range, are obvious, when paired with software that can exploit it. With Tarinder looking at dual-core for the average consumer recently, it's my turn to look at dual-core Opteron for the workstation and server markets.

Let's jump right in with a look at how AMD engineered the dual-core Opteron processor.


Sharing a memory controller
I mentioned on the previous page that AMD's dual-core processor approach has both cores sharing the same die space. They've done so to allow both cores to share a memory controller, which is on the CPU with AMD's K8 generation of CPUs, of which dual-core Opteron is a member. Each core therefore has all the current benefits of AMD's on-die memory controller that a single-core processor does. So while available memory bandwidth doesn't increase when you add the second core, since there's only one memory controller, there's all the benefits of the low-latency controller available to both cores, keeping performance as high as possible.


Sharing HyperTransport links to the system and other CPUs
Every Opteron processor has one HyperTransport link that allows it to connect to devices in the system. Then, depending on the Opteron model, there may be one or two other HyperTransport links available for communicating with other processors in the system. A 1-series Opteron doesn't have any other links for communication with other Opterons in the system, since it's the single processor version. 2-series Opterons have one link, allowing them to connect to one other processor for a maximum of two in the same system. 8-series Opteron has two links, and depending on the topology employed by the mainboard they sit in, that allows you to connect up to eight Opterons together.

Dual-core Opteron doesn't change any of that, with the cores sharing those links via glue logic on the die. You can still place up to eight physical processor packages in a system with a dual-core 8-series Opteron, but the dual-core nature of the processors gives you sixteen processor cores to do work on. So it's not quite as optimal as having sixteen physical CPUs, each with its own memory controller, but it allows you to double processing power in any existing system that supports existing Opteron CPUs.


Cache coherency with MOESI
In any multi-processor system, the caches for each processor core need to be able to talk to each other to maintain coherency, should any processor in the system need data from the cache of any other CPU.

Back in the days of the Athlon MP, AMD implemented the MOESI cache coherency protocol. MOESI stands for Modify, Owner, Exclusive, Shared, Invalid. Each of those is a state the caches in the system can occupy, depending on what's being done with them by the CPU cores. For example, say that core one updates some memory in its cache, before writing it back out to memory. Core two is always snooping the traffic to core one, and as it spots that happening it marks the caches as Modified, to indicate they're not coherent. In a MESI cache coherency scheme, without the Owner state, if core two wanted to read that memory, it would have to ask core one for it, which tells two to hang on a short while while it writes the data back out to main memory.

However, since Athlon MP, single core SMP Opteron, and now dual-core Opteron, has used the Owner state. In the case above, Owner state allows core one to pass the data that core two wanted over the core-to-core interconnect and update the cache on the other CPU directly, without writing it back out to main memory, with the caches then marked as Shared. You can see how that would increase performance.

There's less latency when cache data needs to be updated, since you don't need two trips out to main memory, one per core, for a read and write to get the caches back in sync. It's worth noting that Intel's multi-processor Xeon systems currently implement the MESI protocol, so they do have to go out to main memory if cache data is marked Invalid or Modified.

So there's a fast core-to-core link that allows the cores in any dual-core Opteron system, even one with multiple processors, to update each others caches as fast as possible, with little latency. If the caches to be updates reside on separate physical packages, the cache updated are conducted over HyperTransport. The important thing to keep in mind is that they don't need to hit main memory to do so, unlike current Xeon and dual-core Pentium D and Extreme Edition.

jasong
04-21-2005, 06:24 PM
HyperThreading
If you've kept an eye on Intel's Pentium 4 or Xeon processors since late 2002, when Intel launched the 3.06GHz Pentium 4, you'll know about HyperThreading. HyperThreading is the ability for Intel's Netburst processor architecture to run two threads of execution concurrently on one processor core. By duplicating the front-end logic of Netburst, two threads can be run on the CPU in tandem, boosting performance, provided some caveats aren't hit. Since the two threads are sharing the processor's exeecution units and the cache memory, if either thread wants to use the resources or cache that the other is using, the pipeline of the CPU stalls and performance drops off.

However, HyperThreading laid the groundwork for application and OS vendors to support multi-threading. Dual-core Opteron exploits that by advertising itself as being a HyperThreading processor, so that any HT-aware OS or application uses the two cores on a dual-core Opteron as it would a HyperThreaded Intel processor, to boost performance. Since there's no cache or execution resources to share, performance never need drop off and while properly HT-aware applications split their threads up in a way to minimise using execution resources that the other thread is using, there's still a speedup to be had with Opteron, even if each thread isn't doing the same work on each core. The simple fact that there are two cores helps speed things up regardless.


Summary
So it's two full Opteron cores, each with their own L1 and L2 caches, sharing a memory controller and HyperTransport links to the rest of the system and other CPUs, while advertising itself as supporting HT technology in order to take advantage of the software investment made for Intel's multi-threading-on-a-single-core technology. Pretty simple. Let's see how that works out in a physical sense.



Built using AMD's 90nm process technology in their fabs in Dresden, dual-core Opteron's ~200 million transistors fit into the same die area as a single 130nm Opteron or Athlon FX does today. That has a thermal impact, with AMD using strained Silicon (where the Silicon lattice is stretched by around 1% by binding it to another lattice, usually Germanium) and Silicon-on-insulator (where the Silicon is placed on an insulator bed to prevent current leak and help the transistor switch faster) for its 90nm process, which helps an equivalent 90nm processor have a better set of thermal properties than a 130nm-built counterpart CPU at the same clock frequency and with the same cache sizes.

The x75 Opteron series processors, if they begin with a processor ID of OSA, have a TDP of around 100W. For a pair of 2.2GHz cores each with a large 1MiB L2 cache, that's quite the achievement. Two cores do not mean a massive heat output for any thermal solution to deal with, with the x75 Opteron 'OSA' processors having the same maximum output as a 2.6GHz Athlon FX-55, pretty much. That means current thermal solutions can be used to remove the CPU's heat. Indeed, AMD have overengineered the PIB cooler for recent single-core Opterons with the dual-core versions in mind, so that one set of recent PIB coolers will do for both CPUs.


Drop in replacement?
While Intel are pulling the trick of tying new core logic to the new Pentium D and dual-core Pentium Extreme Edition processors, to force the user that wants it from them into a hydra-like purchase of at least processor and mainboard - which people already invested in recent released Pentium 4 core logic will love - AMD have said that they require only a BIOS upgrade for any existing Socket 940 mainboard for it to support a dual-core Opteron processor.

That's almost true. There are electrical compatibility reasons to consider, too. Only if a board can run single-core Opterons at at least 2.4GHz will it have the circuitry necessary to support a dual-core Opteron. While that's probably 100% of released Socket 940 mainboards, it's worth mentioning. As always, contact your mainboard vendor for absolute confirmation at the same time you ask them for the BIOS to support one.


SSE3
Being a Socket 940 processor, its basic feature set outside of being a dual-core CPU is almost identical to that of the Socket 940 processors that precede it. That means a dual-channel memory controller, support for the x86-64 instruction set architecture (ISA) pioneered by Opteron nearly two years ago and class-leading clock-for-clock performance to put Intel's gigahertz-craving opposition to shame.

However, just duplicating the entire front-end and execution resources of a current Opteron processor aren't what the dual-core variants are all about. Being produced as it is in AMD's 90nm fabs in Dresden, Germany, and having the latest stepping of the basic Opteron/Athlon 64 core, that means support for Intel's SSE3 instruction set first given to the world with their Prescott-1M core Pentium 4.

SSE3 is a set of mostly SIMD (single instruction, multiple data) instructions that help a certain class of programming tasks execute in less cycles with the processor. SIMD processing is somewhat parallel, operating as it does on multiple chunks of data with the same instruction, outputting multiple bits of data at the end. Those paying attention in CPU class will have realised that a couple of SSE3's instructions are related to HyperThreading.


Summary
A dual-core x75 model Opteron has a thermal design power of around 100W, will drop right in to existing mainboards after a BIOS update, supports SSE3 and HyperThreading technologies created by Intel, and implements in full the x86-64 instruction set architecture that enables 64-bit x86 computing, recently made famous by Microsoft's announcement that the first 64-bit version of Windows XP Professional is finished. Of course, 64-bit Windows Server and many Linux variants have been treading a 64-bit path for quite some time.

Let's have a quick look at the processors and the system they were housed in for benchmarking, before I have a look at performance.

jasong
04-21-2005, 06:24 PM
Armari's Gravistar XR

Armari supplied the test platform for my look at dual-core Opteron. A Supermicro SC733T-645 SATA Workstation Chassis, in black, housed all the components. The SC733T is a mid-sized chassis that can take Extended ATX mainboards, with a pair of 120mm 4-pin PWM fans providing the airflow from the front-bottom of the chassis to the exhaust at the rear.

Integrated into the SC733T was a Tyan Thunder K8WE mainboard. The K8WE is based on nForce4 Pro and provides a pair of PEG16X slots for SLI-ready graphics, with each slot receiving the full 16 lanes of PCI Express due to the use of two nForce4 ASICs (2200 and 2050). An AMD 8131 provides the PCI-X segment bridges. The mainboard supports up to 16GB of memory with four slots connected to each processor. For maximum memory density, you need to use 2GiB modules, and the mainboard requires the use of registered DIMMs, with or without ECC ability. Being dual-channel, both memory controllers (if you're using two CPUs) sport 6.4GiB/sec of theoretical memory bandwidth, for a 12.8GiB/sec system total.

Supplied with four 1GiB sticks, two for each processor, the Armari test platform therefore had a full 4GiB in total. Two Seagate Barracuda 7200.8 250GB hard disks were connected to the 2200 ASICs SATA controller for 500GB of total storage. Armari installed Windows XP Professional (32-bit) on one hard disk, with XP Pro 64-bit on the other, leaving me free to just change the boot order in the BIOS depending on what OS I wanted to test.

Two Opteron 875 processors sat in the pair of sockets, each with a pair of 2.2GHz cores with 128KiB of L1 cache and 1MiB of L2. L2 on Opteron is just a retirement cache for L1 in essence, and they're exclusive, so L1 isn't mirrored in L2.

Armari also supplied an NVIDIA Quadro FX 3400 and an ATI FireGL 7100, both on PCI Express, but testing time pressure means an article featuring them will have to wait until another day. Pictures? Thought you'd never ask.


Pictures














Summary
Armari came up trumps for us again, with a really well integrated test system in their favourite SATA-only chassis.

My thanks to Dan and his team for building the test box at such short notice, especially since they had no 275s to hand and had to use the £1500 875s instead. With them knowing I'd have the heatsinks off and the CPUs out in my grubby mits for pictures, that's a fine display of faith in my cackhanded ability to handle hardware.




Product: AMD's Dual-Core x75 Opterons
Author: Ryszard
Date: 21st April 2005
Sample Provider: Armari


Hardware and Software
Test Platforms
Dual-core Opteron system Intel Pentium Extreme Edition 840 system AMD Athlon FX system
Processor(s) 2 x AMD Opteron 875
2.2GHz, 1MiB, dual-core Intel Pentium Extreme Edition 840
3.2GHz, 1MiB, dual-core AMD Athlon FX-55
2.6GHz, 1MiB, single-core
Mainboard Tyan Thunder K8WE Intel D955X Express DFI LanPartyUT nF4 SLI-D
Memory 4 x 1GiB DDR400 ECC Registered
3.0-3-3-8 @ 400MHz 2 x 512MiB Micron DDR2-667
5-5-5-15 2 x 512MiB Corsair XMS3200 XL Xpert
2-2-2-5
BIOS Version 2003Q2 - 28th March 2005 18th March 2005 9th February 2005
Disk Drive 2 x 250GB Seagate 7200.8 SATA 160GB Western Digital PATA 300GB Maxtor 6B300S0 SATA
Graphics Card NVIDIA GeForce 6800 Ultra - PEG16X - 71.84 ForceWare
NVIDIA GeForce 6800 Ultra - PEG16X - 71.84 ForceWare 64-bit ATI RADEON X850 XT PE - PEG16X - CATALYST 5.3 NVIDIA GeForce 6800 Ultra - PEG16X - 71.84 ForceWare
Operating System Windows XP Professional, SP2
Windows XP Profession 64-bit, SP1 Windows XP Professional, SP2 Windows XP Professional, SP2
Mainboard Software NVIDIA nForce4 Platform Driver 6.53
NVIDIA nForce4 Platform Driver 6.39 BETA 64-bit Intel INF Update Utility 7.0.0.1014 NVIDIA nForce4 Platform Driver 6.53

Benchmark Software
3D Studio Max 6 - HEXUS Superstress v2
Realstorm 2004
Canopus ProCoder v1.50
ScienceMark 2.0
Kribibench v1.1
picCOLOR 4.0
Cinebench 2003
LAME 3.96

Notes
Despite the CPUs being Opteron 875s, I've labelled them on the graphs as Opteron 275s, since it's a dual CPU system.

With testing done on 64-bit Windows XP as well as the regular 32-bit version, there's some explanation needed as what applications were tested on what operating system. Given that Windows XP 64-bit can run 32-bit binaries unmodified, all the 32-bit tests were run on that OS. Where a 64-bit binary was available, in the case of picCOLOR and ScienceMark, that was also run to see what effect it had on performance.

Not all 32-bit tests were run on all systems. Tarinder did the testing of the Extreme Edition 840 and he used a slightly different set of benchmarks during his analysis, so some graphs will be missing an Extreme Edition score.

Secondly, I emulated an Opteron 175 system by pulling a CPU out of the dual-core Opteron test system and moving that CPU's memory over onto the remaining CPU. So there was one socket empty and the remaining CPU had 4 1GiB sticks attached to its memory controller. Frustratingly, that setup wasn't entirely stable, with the BIOS revision on the Tyan seemingly happiest when both CPUs were in and working, rather than there just being one. So some graphs are missing an Opteron 175 score. Bleh.

More bleh at the fact that my Xeon mainboard is M.I.A, so I've got a pair of Noconas and nowhere to shove them. As soon as the board gets to me, we'll do a followup piece on Xeon performance, too. My apologies for the somewhat hodgepodge nature of putting comparison numbers together, but we've had to do our best with limited resources. Rest assured we'll flesh things out in due course.


CPU-Z System Information
CPU
Memory
Mainboard
SPD Memory Timings
Caches
Running one CPU instead of two


Windows Information
Device Manager with both CPUs installed, XP64
An amusing typo by whoever set the box up for me at Armari :P
System Properties, XP64

jasong
04-21-2005, 06:26 PM
Product: AMD's Dual-Core x75 Opterons
Author: Ryszard
Date: 21st April 2005
Sample Provider: Armari

printer friendly layout discuss in the forums email to a friend


ScienceMark 2.0
Advertisment


ScienceMark 2.0 provides us with an number of interesting benchmarks, including memory subsystem tests and single and multi-threaded scientific simulations.


ScienceMark 2.0 Memory Bandwidth


The use of registered memory with 3.0-3-3-8 timings on the Opterons robs them of ~700MB/sec of bandwidth as measured by ScienceMark 2.0, compared to the 2.0-2-2-5 DDR400 low-latency stuff used with the FX-55. The single Opteron 175 measures slightly higher than the dual processor system, and all the AMD systems outpace the lacking Pentium 4 in raw terms.


ScienceMark 2.0 Memory Latency


Under Windows XP 64-bit, the single Opteron 175 can access system memory slightly faster than the dual Opteron 275. Registered memory and lax timings do the rest to lag the FX-55 by over 30%. The Pentium 4 brings up the rear. On-die memory controllers do the business.


ScienceMark 2.0 Molecular Dynamics


The MolDyn benchmark is multi-threaded and under Windows XP 32-bit the FX-55 trails the Opteron 275 system, four cores accelerating the test faster than the desktop powerhouse can.

Under XP 64-bit the test speeds up massively. There are only two computation threads in the MolDyn test, with the single Opteron 175 matching the dual Opteron 275 for performance. The combination of a 64-bit OS and CPU, and multi-threading, makes for massive performance leaps. The 840 EE wasn't around long enough for us to run MolDyn for comparison.


ScienceMark 2.0 Primordia Silicon


Primordia is multi-threaded, but the test shows the 2.6GHz single-core FX-55 besting the four-core Opteron 275 test system, under Windows XP 64-bit. The element we choose, Silicon, doesn't seem to allow the multi-threaded nature of the test, with two threads, to surface. Lots of core clock and low latency memory are faster.

Under XP 64-bit performance increases nearly twofold. The single Opteron 175 is as fast as the dual Opteron 275, as you'd expect from a two-thread test.

So initially, what we're seeing is a dual-core processor effectively doing as well as two discrete single-core processors would, when threading is limited to only two threads. Along with that, we're seeing huge gains to be made by running a native 64-bit binary of the same program under a 64-bit OS.

IronBits
04-21-2005, 08:53 PM
Thanks for the info. Links to the original Author's work would suffice ;)

http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2397&p=10