Final EVGA VRM Torture Test: VRM Thermals Not the Killer of Cards

Posted on November 23, 2016

Two EVGA GTX 1080 FTW cards have now been run through a few dozen hours of testing, each passing through real-world, synthetic, and torture testing. We've been following this story since its onset, initially validating preliminary thermal results with thermal imaging, but later stating that we wanted to follow-up with direct thermocouple probes to the MOSFETs and PCB. The goal with which we set forth was to create the end-all, be-all set of test data for VRM thermals. We have tested every reasonable scenario for these cards, including SLI, and have even intentionally attempted to incinerate the cards by running ridiculous use scenarios.

Thermocouples were attached directly to the back-side of the PCB (hotspot previously discovered), the opposing MOSFET (#2, from bottom-up), and MOSFET #7. The seventh and second MOSFETs are those which seem to be most commonly singed or scorched in user photos of allegedly failed EVGA 10-series ACX 3.0 cards, including the GTX 1060 and GTX 1070. Our direct probe contact to these MOSFETs will provide more finality to testing results, with significantly greater accuracy and understanding than can be achieved with a thermal imager pointed at the rear-side of the PCB. Even just testing with a backplate isn't really ideal with thermal cameras, as the emissivity of the metal begins to make for questionable results -- not to mention the fact that the plate visually obstructs the actual components. And, although we did mirror EVGA & Tom's DE's testing methodology when checking the impact of thermal pads on the cards, even this approach is not perfect (it does turn out that we were pretty damn accurate, though, but it's not perfect. More on that later.). The pads act as an insulator, again hiding the components and assisting in the spread of heat across a larger surface area. That's what they're designed to do, of course, but for a true reading, we needed today's tests.

Video Version of this Content: EVGA GTX 10-Series Temperatures Not the Issue

Note: We will almost certainly not make return on this testing investment. Content like this is made possible only with the support of our readers, like through Patreon.

Some of the testing content in this article will be straight from the script of the video -- no reason to write it twice -- but we do also have a handful of extra charts that won't be found in the video content. One set of charts, for instance, is thermal testing of the cards with the thermal pads installed, but without the VBIOS update.

Recap of EVGA VRM "Issues"

Let's recap the basics again. We recently validated a Tom's Hardware test, which suggested that EVGA ACX devices were heating up on the back-side of the VRM to north of 100C. Note that VRMs can handle 100C no problem, but the temperatures that Tom's had shown -- hitting 114C in some reports -- were beginning to enter a range of being concerning. The fact that a few users began sharing photos of scorched PCBs furthered this concern of temperature-related damage to EVGA cards.

We validated their methods by deploying a thermal camera just like them, but noted that emissivity and the delta between the back-side PCB and front-side VRMs could be significant, and we decided that thermal imaging was not sufficient to fully evaluate the situation. EVGA issued thermal pad mods optionally, and a VBIOS update that increased the aggression of the fan speed curve. We declared that both of these, by thermal imaging, were enough to fix the problem.

Until today, this problem was largely assumed to be because of EVGA's lack of thermal pads between the baseplate and heatsink on the VRM side, and between the PCB and the backplate on the back-side. But we've got new findings which definitively indicate that this is not the only cause of failure.

Now, in addition to the tests posted by Tom's DE and by our later follow-up, some users have complained that high VRM temperatures are causing black screen defects. This is not true. EVGA had black screen issues on the first ~4% of its shipping product, resolved a few months ago, but they were entirely unrelated to the VRM temperature. If a VRM gets too hot, it will not do so with grace. There will be no "black screen" that can be resolved by a restart. The FETs / power stages will go up in a puff of smoke, and the card will never turn on again. These are two unrelated issues. The black screen defect -- for which we own one card exhibiting the issue -- was already resolved.

A few users have also indicated that VRAM contact is not sufficient between the heatsink and the VRAM thermal pads. We have not observed this on our cards. That is not to say that there is no such issue, but does mean that we can't validate it. VRAM modules can handle pretty high heat, anyway, and EVGA has begun shipping VRAM pads in addition to the VRM pads.

New Testing Procedures

So, then, we need to investigate the impact of thermals on card life. EVGA's issuance of thermal pads might suggest that there is something more to learn here, and so we'll be performing the following tests:

Tests with the card stock, with the old VBIOS and no thermal pads
Tests with just the VBIOS update
Tests with thermal pads only (no VBIOS u pdate)
And tests with the thermal pads and the VBIOS update

This is all being done on our pair of EVGA GTX 1080 FTWs.

In addition to these test categories, we will run about a half-dozen tests on each configuration:

Kombustor's implementation of FurMark
Metro: Last Light and DiRT Rally
Overclocking and overvolting
Brief SLI testing
And high ambient torture tests

A few additional tests were performed, like FurMark (non-Kombustor) testing, 3DMark, and a few other games, but we began scrapping a few of the less useful tests as we narrowed the useful data-set to the above passes.

In our previous video on EVGA test planning, we explain that our new tests apply K-type thermocouples directly to the rear-side of the PCB and to hotspot MOSFETs numbers 2 and 7 when counting from the bottom of the PCB. The thermocouples used are flat and are self-adhesive (from Omega), as recommended by thermal engineers in the industry -- including Bobby Kinstle of Corsair, whom we previously interviewed.

K-type thermocouples have a known range of approximately 2.2C. We calibrated our thermocouples by providing them an "ice bath," then providing them a boiling water bath. This provided us the information required to understand and adjust results appropriately.

As for other concerns, these were largely discussed in that EVGA test planning content. We'd mostly have to look out for (1) thermal conductivity and the impact of a thermocouple in its area of placement, and (2) electrical conductivity and avoiding inadvertent damage to components by accidentally causing an electrical short.

With Kinstle's help, we were able to locate flat thermocouples with an adhesive that will not prohibit transfer of heat between the MOSFET casing and its present thermal pads. As a reminder, EVGA included thermal pads between the FETs and the base plate on all cards from the get-go. The only places in which pads were not provided were between the base plate and the heatsink, and between the backplate and the PCB.

Our next point of concern was smaller, as it'd be easier to resolve and spot: EMI caused by inductors or the power plane PCB. We were able to avoid electromagnetic interference by routing the thermocouple wiring right, toward the less populated half of the board, and then down. The cables exit the board near the PCI-e slot and avoid crossing inductors. This resulted in no observable/measurable EMI with regard to temperature readings.

We decided to deploy AIDA64 and GPU-Z to measure direct temperatures of the GPU and the CPU (becomes relevant during torture testing, when we dump the CPU radiator's heat straight into the VRM fan). In addition to this, logging of fan speeds, VID, vCore, and other aspects of power management were logged. Because VRMs are not measurable through software, our direct thermocouples will handle that aspect of testing.

The test platform is detailed below:

GN Test Bench 2015	Name	Courtesy Of	Cost
Video Card	EVGA GTX 1080 FTWs	EVGA	~$740
CPU	Intel i7-5930K CPU 3.8GHz	iBUYPOWER	$580
Memory	Corsair Dominator 32GB 3200MHz	Corsair	$210
Motherboard	EVGA X99 Classified	GamersNexus	$365
Power Supply	NZXT 1200W HALE90 V2	NZXT	$300
SSD	HyperX Savage SSD	Kingston Tech.	$130
Case	Top Deck Tech Station	GamersNexus	$250
CPU Cooler	NZXT Kraken X61 CLC	NZXT	$110

What The VRM Can Theoretically Handle

As for the VRM, EVGA's TjMax is 150C, and probably—one would hope—should initiate OTP at 180C. The power stages best operate at 100C continuous, but have a tCase of 125C. Inductors really don't much matter for heat dissipation since they can take so damn much -- it's just copper wire coiled inside of a natural heatsink -- but they do heat up neighboring components. The FETs have a thermal pad contacting the baseplate, it's just that the original cards had no pad to allow transfer of heat from the base plate to the heatsink.

And after our tutorial on applying those thermal pads, we saw some misguided comments about EVGA's suggested placement of the pad atop the chokes. That is the best way to get contact to the fins of the heatsink and transfer heat and, even though it's not as much surface area as a coldplate, it still performs damn well—a far cry better than the stock configuration. Some folks seemed to think that this pad placement stopped air from getting down there; well, air never got down there anyway, and the thermal pads flanking the chokes dictate that it couldn't get beyond the inductors to begin with. Air also has a terribly low thermal conductivity -- you're looking at something like 0.3W/mK at 25C, as opposed to thermal pads (minimally 10W/mK, though we don't have an exact number) and aluminum (~205W/mK at 25C).

The argument about air not "being able to get to the VRM with the new pads" is uninformed. Ignore it. We will prove that with testing later.

Then, of course, there are rumor mills like WCCFTech, which take no shame in pumping-out headlines like "VRM Burn-Out Issue Caught On Camera" without ever even attempting to validate if the VRMs are the issue. Other sites have used the word "explode" generously in headlines.

We're here to bring some actual testing to the discussion, hopefully bringing it back down to reality.

EVGA VRM Noise Test with VBIOS Profile Update

First, let's recap on noise. Our original noise tests on the EVGA VRM fan were conducted using preliminary information on the new VBIOS and its more aggressive fan speed profile. Since that time, EVGA's publicly issued VBIOS update reduced the fan speed profile from what we were initially provided. The final, maximum fan speed for a single card seems to sit around 1900~2050RPM, rather than the initially planned 2200RPM. The impact on noise is somewhat substantial, since our first tests showed a ~10dBA increase from the ~1600RPM of the original VBIOS. Here's the update with the final, public VBIOS profile:

EVGA VRM Thermal Testing (FurMark Kombustor): Old VBIOS, No TPADS

Let's start with the complete stock card, as it originally shipped from EVGA.

This first test is the stock card without overclock, running Kombustor FurMark as a burn-in. Remember that FurMark is sort of a power virus, and loads the VRM more heavily than any game will ever do. Also note that FurMark doesn't blast the clock as much as a game would, but load is still heavy.

Here's the chart. The colors will be the same for every chart shown, so memorize them n ow: Yellow is MOSFET 7, counting bottom-up, and is a significant hotspot on the card. This is yellow. MOSFET 2 is a common scorch point on photos we've seen online, toward the bottom of the card. This is orange. PCB is cyan, and is measured on the hotspot on the rear-side of the video card with the backplate on. GPU temperature is white and measured by software. The ambient temperature is also critical to these tests, as we'll later double ambient. That's the darker blue line at the bottom.

We're seeing the PCB achieving temperatures just shy of 100C after a one-hour burn-in. The MOSFETs are both at around 90-94C, with MOSFET 7 running a bit warmer. Ambient was in the low 20s. Case ambient, as we show in our 570X review, can be upwards of 40C in some enclosures. That would account for some gains in temperature, but not a 1-to-1 gain. We'll test for this situation later in this article.

So far, though, these are all numbers that the card is built to handle -- and that's with FurMark.

EVGA VRM Thermal Testing (Gaming): Old VBIOS, No TPADS

Here's Metro: Last Light running a burn-in. We're seeing temperatures closer to 85C for the PCB backside and MOSFET #7, with MOSFET #2 around 80C. That's about 10-20C cooler than with FurMark. Other games show similar performance results.

EVGA VRM Thermal Testing (Overclocking): Old VBIOS, No TPADS

This chart shows the overclocking impact on a 1080 FTW without the VBIOS update and without thermal pads, as benchmarked using FurMark. Temperatures get a little warmer here, now nearing 105C on the PCB and MOSFET 7. That's hot enough that high case ambient would decrease your efficiency as you near 110C, but you will still be within safe operating range as we show in the forthcoming high ambient tests. The overclock was +30% power (engaging the master switch), +100% OV allowance, and +125MHz core / +450MHz memclock.

Now, before that high ambient test, and before applying thermal pads and VBIOS updates, the next goal is to test SLI 1080 FTWs with a one-slot spacing between them.

Continue to the next page for SLI testing and VBIOS/TPAD testing.

Continue to Page 2.

EVGA VRM Thermal Testing (SLI 1080 FTWs): Old VBIOS, No TPADS

This next chart had some manual tuning going on about half-way through the test, so keep that in mind as you read the content. It's not a set run-and-done benchmark, as the others were. We manually tuned the fan profile and settings to attempt to create worse (but not worst) scenarios for testing. The worst case scenario will be our torture test, on the next page.

The first half of this test was without any tester interference at all. We're at around 90-94C for the FETs and the PCB temperatures, with GPU temperatures maintained around 81C -- the thermal limit to which fan RPM will slave -- and that triggers a fan RPM of about 80%.

This is where we decided to start playing with the idea of torture testing, and so we dropped fan RPM back to its 60% speeds (from auto) that a single card would exhibit. This pushes the temperatures up to about 105-107C and is quieter, but not realistic for a user.

Here's the interesting part. You might have noticed that the VRM component temperatures for SLI with FurMark are actually lower than the temperatures we saw with a single card. That's because of the GPU. The fans are not aware of the temperature for any other component on the board. The entirety of PWM for the fans is governed by the GPU, so the hotter the GPU core gets, the faster the fans will run. The fans, for all intents and purposes, have no idea that the VRM exists. Because SLI basically guarantees that at least one device is constantly under threat of exceeding 90C GPU core, let alone its 81C threshold, the fans spin-up and run faster than what a single card's profile will normally allow. Our speeds hit around 80% when auto-controlled. That cools the GPU, but the VRM fan is also spinning faster, thus cooling the VRM components.

So, if you've got an SLI configuration, it's not so grim. You should still update VBIOS, but this isn't a doomsday scenario. With a hot case, you'll run into more issues -- it's still not terrible, though.

If you want to see how just the thermal pads perform without the VBIOS update, check the article for that.

NEW VBIOS TESTING SECTION

EVGA Thermal Testing (FurMark Kombustor): New VBIOS, No TPADS

Here's the first chart of the VBIOS updates. This one shows the FTW VRM with the new VBIOS, but without the thermal pad mod. We're hitting temperatures of around 85C on the PCB and around 80C on the 2nd MOSFET. Compared to the stock card with the original VBIOS, this more aggressive fan speed profile improves performance by about 6-10C on the VRMs and about 15C on the PCB. This is mostly the same as what we reported with the original PCB temperature during thermal imaging, and reconfirms our statement that the VBIOS update alone is enough to resolve issues.

Metro: Last Light VRM Thermals with New VBIOS, No TPADS

And here's what it looks like with a video game. This is Metro: Last Light at some of its most demanding settings on the GPU side, looped for a full hour. The outcome is temperatures around 75C on the PCB and MOSFETs. That is more than just acceptable -- these VRMs can handle way higher heat than that. And this, mind you, is with an ambient temperature of about 23-25C, as opposed to the earlier ambient values nearer 20C.

Overclocking VRM Thermals with New VBIOS, No TPADS

Last chart for the VBIOS updates; at least, before we move on to the thermal pad mod. This chart shows overclocking of +30% with the Master switch, and overvolting of +100%. The VRMs are hitting temperatures of around 85-90C, which is an improvement of more than 10C when looking at the old VBIOS and its OC performance. Again, those are acceptable temperatures.

NEW THERMAL PADS SECTION (Old VBIOS)

FurMark (Kombustor) VRM Thermals with Old VBIOS, New TPADS

With FurMark and the thermal pads, but the original VBIOS, we see a change that results in thermal performance of <90C PCB backside, or right around 80C for MOSFET #7. MOSFET #2 is closer to ~77-79C.

Metro: Last Light VRM Thermals with Old VBIOS, New TPADS

For Metro, similar scaling is seen. The temperatures are below 80C across the board. If you wanted, based on these results, you could go with just the thermal pad change and forgo the VBIOS, if the quieter fan profile is preferred.

NEW VBIOS WITH THERMAL PADS SECTION

FurMark Kombustor VRM Thermals with New VBIOS, New TPADS

And here's a look at thermal pads in addition to the new VBIOS, when running FurMark and stock clocks. The card is way below where it was stock -- somewhere to the tune of 20-30C, depending on which test and component you look at.

Temps are around 85-87C for the backplate and are around 75-80C for the FETs while FurMark burn-in is running.

Continue to the final page for the torture test, where we actively try to set the card on fire.

Continue to Page 3 (final page).

Just to give you an idea, our video of this content has some footage of our CPU radiator dumping heat straight into the VRM fan. That's done with Prime95 running LFFT crunching on an overclocked CPU that's operating at ~170W, and with only one of the radiator fans turned on. This is more than just a simulation of a high ambient case temperature, it's directed, hot airflow straight at the VRM side of the card. The effective temperature that the card was breathing was often 40C of pure heat from the CPU radiator.

We even lowered the fan RPM on the EVGA card to 50%, provided 30% more power, overvolted it 100% of allowance, and overclocked it by 125MHz. Another test that we'd tried was connecting a case fan pointed strictly at the GPU, such that the GPU could run a colder temp and thus a higher clock, but the VRM would still be incinerated by lack of cool air.

First Torture Test: Old VBIOS, No TPADS

None of it worked. This chart is the worst of the two. There are no new thermal pads installed and an old VBIOS in use, and after overclocking (seen around the 3000-second mark), we were able to hit about 126.8C max on the PCB backside, and about 108-109C on the VRMs. The ambient fluctuation you see is from when the radiator was moved closer to the card, so that it would incinerate itself as much as possible.

Moving on to the torture test with the new VBIOS and thermal pads, the worst we got was about 82C on the PCB. That's mostly the thermal pads doing work, too, so don't listen to those comments saying that the thermal pads do nothing. They are utterly, completely false, and based on nothing. We've done dozens of hours of tests at this point, and we can confidently tell you that the thermal pads massively contribute to performance. Just looking at the fan speeds of those two charts, you'll see that we're even below the overall speed when we were manually tuning the cards for worst case scenarios.

How Many EVGA Cards Have Failed? If Not Thermals, Then Why?

In speaking with EVGA, it sounds like they're at about a 200 DPPM (Defective Products Per Million) rate for their cards. This means that, for every one million cards shipped, about 200 are defects. We're told that this number is fairly consistent with previous generations, it's just that the defects are more noticeable this time because of the way the internet works.

In our video, this is the point where Actually Hardcore Overclocking's Buildzoid joins us to analyze the VRM from the electrical side. Check the video for that (~20 minutes in).

It's clear to us, at this point, that there's no thermal issue with EVGA's cards. Any defects we're seeing are being caused by other factors, like workmanship or manufacturing defects. If you have a problem, contact EVGA -- but it's probably not being caused by thermals, and it does still seem like the internet has amplified these occurrences to a point where it sounds worse than it is in reality.

That's not to let EVGA off the hook. Overlooking thermal pads was silly, if only because their performance is measurably worse in at least one aspect when compared to competition. It's good that they've acted quickly to add thermal pads, but that never really was the issue to begin with. People see temperatures hitting 100C, 105C, and immediately worry, we think; that's normally the TjMax of a CPU or above TjMax on a GPU, so it sounds bad. But the VRM can handle 125C ambient / 150C TjMax, and tCase on the power stages is 125C (~100C recommended continuous by OnSemi).

Any failed cards -- although DPPM sounds low, from EVGA's conversation with GamersNexus -- seem to be attributable to manufacturing, craftsmanship, or other defects. Bad caps that eventually short into the board would mostly explain the issue, since that could occur regardless of load and heat -- it'd just be a gradual decline of the capacitor's ability to function properly, eventually leading to a pop and scorch mark. We think that's most likely.

For those who may be concerned that the capacitors are overheating: EVGA's poscap and solid-state caps are both rated for 105C. Even in hour-long stress tests with gaming workloads, we were not achieving temperatures above roughly 85C on the FETs and the back-side of the PCB (which will be the hottest and show some delta vs. the front-side). You would probably have to run FurMark for a long period, overclock/volt the card, and stick it into a case with something like a 40C ambient to cause a capacitor burn-out. It's just not a likely point of failure for gaming and real-world workloads.

Recapping what was stated in the video conclusion: If you already own one of these devices, stop worrying about it. Apply the VBIOS update or thermal pads, just because "free" thermal performance is always good, but don't worry about the VRMs popping. If something dies, it would have happened anyway, and that chance is well below 1% of shipping product (per EVGA RMA numbers). Far below it, actually -- closer to 0.02%.

If you were looking at buying one of these cards, but feel uncomfortable with a purchase because of previous black screen and thermal concerns (though now debunked, they didn't make the company look good), then skip the purchase. Buy something else. There's no harm in that.

That said, if you had your heart set on the cards, then we see no thermal/VRM reason to forgo a purchase at this time. All cards sold after November 1 contain the VBIOS update and thermal pads, and EVGA has informed us that they've pulled some cards back from retailers to apply them. Now, again, that wasn't really ever the issue. Try to keep that in mind. It's just that they should have been there to begin with, but not because the heat was causing failures. All the failures that are being posted to web forums are likely an amplification of what normally happens privately through RMA emails, but louder because of how the internet works.

To restate: This isn't saying EVGA is in the right. The card could have been designed better, and there are still failures, it's just not the reason everyone seemed to think. Maybe bad caps, maybe the usual mix of workmanship / supply-side quality control, but not the VRM temperatures.

Editorial, Test Lead: Steve "Lelldorianx" Burke
Video Production: Andrew "ColossalCake" Coleman
Additional Analysis: "Buildzoid" of Actually Hardcore Overclocking