[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]


[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]

Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)

Octane Dual-CPU SPECfp95 Performance
Comparison Using Different R10000s

Last Change: 26/May/1998

"People use statistics as a drunk uses a lamp post: for support rather than illumination."

SPEC's Introduction to SPEC95

Note: the 2D bar graphs shown here for the various SPECfp95 tests have been drawn to the same scale. They are also at the same scale as other 2D bar graphs for dual-CPU systems, but they are not to the same scale as any 2D bar graph for a single-CPU system, or 4-CPU system, etc.


Objectives

This analysis compares the SPECfp95 performance of different dual-CPU Octane configurations. At present, this means comparing dual-R10K/195 to dual-R10K/250.

SPECint95 is not covered because the MIPSpro Auto Parallelizing Option does not appear to be relevant for running integer tasks on multi-CPU systems. At least, that's what I must infer from the lack of any multi-CPU SPECint95 results. I don't think this means integer tasks can't be parallelised; rather, I believe it's simply the case that, at present, the compilers do not deal with parallel integer optimisation.

There are many other tasks which can benefit from dual CPU systems, eg. LSDyna (engineering analysis), Performer, etc. However, data on those tasks is hard to come by, so this page deals only with SPECfp95.

Note that I do not have any SPECfp95 data for dual-R10K/175MHz. Since many systems will be using this CPU, please contact me if you have any relevant detailed data.

As with all these studies, a 3D Inventor model of the data is available (screenshots of this are included below). Load the file into SceneViewer or ivview and switch into Orthographic mode (ie. no perspective), rotate the object horizontally then vertically, etc.

All source data for this analysis came from www.specbench.org.

Given below is a comparison table of available dual-CPU R10000 SPECfp95 test results for Octane. Faster configurations are leftmost in the table (in the Inventor graph, they're placed at the back). After the table and 3D graphs is a short-cut index to the original results pages for the various systems.

          R10000    R10000       % Increase
         2x250MHz  2x195MHz   (2x195 -> 2x250)

tomcatv    49.1      45.7           7.44%
swim       63.6      58.9           7.98%
su2cor     20.0      17.0           17.7%
hydro2d    15.6      14.4           8.33%
mgrid      32.2      28.4           13.4%
applu      19.9      17.3           15.0%
turb3d     16.4      13.4           22.4%
apsi       15.7      12.7           23.6%
fpppp      37.7      29.8           26.5%
wave5      29.3      21.7           35.0%

Peak Avg:  26.6      22.7           17.2%

      Octane SPECfp95 Comparison

[Left Isometric View] [Right Isometric View]

(click on the images above to download larger versions of the views shown)

[Test Suite Description | 2x250MHz | 2x195MHz]


Next, a separate comparison graph for each of the ten SPECfp95 tests:

tomcatv:

tomcatv comparison graph

swim:

swim comparison graph

su2cor:

su2cor comparison graph

hydro2d:

hydro2d comparison graph

mgrid:

mgrid comparison graph

applu:

applu comparison graph

turb3d:

turb3d comparison graph

apsi:

apsi comparison graph

fpppp:

fpppp comparison graph

wave5:

wave5 comparison graph

Observations

Do the above results look rather confusing, compared to Octane single-CPU comparison results and Octane vs. Origin (single-CPU) comparison results? If so, good! I say this because I don't like the fact that SPECfp95 allows autoparallelisation. Why? My rationale is simple: not all of the ten tests can be parallelised to any great degree; in fact, some cannot be parallelised at all as SPEC's own documents state. There are two totally different statistical factors at work which decision makers need to be aware of; these factors are mixed together in the following manner:

So what does this all mean? It means you can't tell what on earth is going on! For good evidence of this, look at the factor which results from dividing the 2x250MHz figures (or the 2x195MHz figures, it doesn't matter which) by the percentage increases they correlate to (reorganised here to give highest-factor-first):

          R10000      % Increase           Factor
         2x250MHz  (2x195 -> 2x250)

swim       63.6          7.98%              8.0
tomcatv    49.1          7.44%              6.6
mgrid      32.2          13.4%              2.4
hydro2d    15.6          8.33%              1.9
fpppp      37.7          26.5%              1.4
applu      19.9          15.0%              1.3
su2cor     20.0          17.7%              1.1
wave5      29.3          35.0%              0.8
turb3d     16.4          22.4%              0.7
apsi       15.7          23.6%              0.6

I hope you can see what is happening: in general, the tests which were already doing well because of parallelisation on the lower clocked dual-CPUs end up with the lowest percentage increase when upgrading (and thus they have the highest factors); meanwhile, the tests which may as well be run on a single CPU system, because they're hard to parallelise (and had lower original scores as a result), end up with the highest percentage increases when upgrading. This results in the bizarre situation where the tests which have highest absolute individual SPEC ratios give the worst percentage increases from an upgrade, simply because a) the non-parallelisable tests are behaving as if they were in a single-CPU system, and b) it is easy to forget that the percentage increase for parallelised results applies to all CPUs, making the percentage increase look worse than it actually is (ie. the absolute SPEC ratio for such tests is higher and still very high compared to single-CPU performance).

In other words, if one focuses too much on percentage increases in performance for parallelisable tests, one can easily lose sight of the fact that the absolute performance is still enormous compared to single-CPU performance.

What irritates me is the way the various results, ie. those which gain from dual-CPUs plus those that don't, are all mixed together to produce a final SPECfp95 peak average which statistically has nothing to do with the actual variances and behaviour of each individual test, because we're effectively mixing single- and dual-CPU performance metrics - this is like combining statistical results from surveys that had different sample sizes (such surveys have different errors, etc.) Ouch!

So which tests do actually benefit from autoparallelisation? Here is a comparison graph of single vs. dual Octane/195 (complete analysis available separately):

          R10000    R10000        % Increase
         2x195MHz  1x195MHz    (1x195 -> 2x195)

tomcatv    45.7      25.3            80.6%
swim       58.9      40.6            45.1%
su2cor     17.0      9.64            76.4%
hydro2d    14.4      9.97            44.4%
mgrid      28.4      15.9            78.6%
applu      17.3      11.2            54.5%
turb3d     13.4      13.8            -2.9%
apsi       12.7      12.8            -0.8%
fpppp      29.8      29.7            0.3%
wave5      21.7      22.4            -3.1%

Well, it couldn't be clearer in my opinion: turb3d, apsi, fpppp and wave5 are not affected, while all the other tests gain to a significant degree. This doesn't mean that the non-accelerated tests cannot be parallelised; it merely means that SGI's compilers don't currently affect them (whether this is because the tests factually cannot be accelerated is a separate issue). The reason I say this is because results from other vendors show that different vendors' autoparallelising compiler options behave in very different ways. Note: if you're wondering why some tests appear to slow down slightly, this is because, obviously in my opinion, the autoparallelising option will interfere slightly with the optimisations which would normally occur for the relevant tests in a single CPU system (besides, the differences are well within standard margins of error anyway).


Variance in SPECfp95 for most RISC systems is wierd enough already (eg. R10K O2 shows almost an order of magnitude difference between highest and lowest), but allowing autoparallelisation means that the situation for decision makers has become much worse. For dual-250MHz Octane, the final peak result is 26.6, yet the actual figures can be anywhere from almost half this to nearly 2.5 times as much. Of what use then is the final peak average? It conveys no useful information whatsoever about the system's overall performance. SPEC95 is supposed to be a useful guide, but we now have a situation where people could be comparing single vs. multi CPU results without being aware of it, or comparing multi vs. multi CPU results but for systems with totally different autoparallelisation profiles.

If you're thinking ahead by now, you'll realise that this situation becomes considerably worse when looking at multi-CPU Origin systems that have 4, 8, 16, etc. CPUs. Individual results range from 20 to more than 400, making complete nonsense of the final average. During early 1998, SGI held the 'absolute' SPECfp95 record simply because of the accidental effects of statistical averaging. If a rival vendor's system had accelerated a smaller number of the tests each to a proportionally greater degree (eg. accelerated 5 tests by an average of 80% each instead of accelerating 6 tests by an average of 63.3% each), then the rival would have held the record even though one could argue their autoparallelising software was of less use because it was applicable to fewer code types. This is insane! It's the Jacob's Ladder of statistcal nightmares! Note that since I first wrote this page, DEC has released the 21264 and so they now hold the absolute SPECfp95 record, at least for the moment; of course, their results show a similar spread of affected and non-affected tests.

So, I say never use autoparallelised multi-CPU peak final averages in conversations about system performance when comparing with other vendors, or even the same vendor. The results are meaningless, the average is meaningless and any conclusions drawn will be meaningless. This is a case where one must absolutely break away from the final average and deal with individual test results. Only then does one see the interesting phenomena which are important to decision makers. From the studies I've carried out, there is one important fact that a decision maker should know about autoparallelised results (burn this into your brain if you are such a person):

If you think about this for a minute, the reason why the final averages are utterly meaningless should become clear: two different systems from two different vendors can give detailed results which are radically different (ie. different selections of the ten fp tests accelerated to different degrees), yet the blind averaging process can squash all the detail away to give final peak numbers that imply the two systems are in fact very similar. This is statistical madness. For a decision maker who is interested only in their particular type of task, they may be faced with similar averages from two vendors when in fact the individual results may show that vendor A's 8-CPU system will make their code eight times faster whereas vendor B's system won't accelerate their code at all. Now I hope you can see why this page has the humerous subtitle. :)

For a typical example of this, examine the individual results for the AlphaServer 8400 5/625; this 8-CPU system gives a final peak result of 56.7 (not that far off the 8-CPU Origin2000's result of 66.4), yet the individual results show the DEC system to be accelerating its own selection of the ten fp tests in a very different way; eg. comparing 8-CPU results to 1-CPU results, tomcatv is accelerated on the Alpha by 143% compared to the Origin2000's 628%. On the other hand, turb3d is accelerated on the Alpha by 436%, compared to the Origin's 0.5%; both systems accelerate swim by a good margin: Alpha runs it 12X faster whilst Origin runs it 9X faster. Here's a quick index for you to compare results at your leisure:

[ AlphaServer 8400 (8-CPU) | AlphaServer 8400 (1-CPU) ]

[ Origin2000 (8-CPU) | Origin2000 (1-CPU) ]

So, watch out! Autoparallelised results can damage your health (read 'wealth' :)

Some general guidelines:

But there's more confusion I'm afraid. Differences in compiler revisions have caused further oddities. Out there in SPEC-land are multi-CPU results which are not using the autoparallelising option - mostly older results from the early part of 1997, eg. 2-CPU and 32-CPU Origin2000; they are completely incomparable with later results which do use the autoparallelising option. One can end up wondering why a 32-CPU system appears to be much slower than a 16-CPU system...

Thus, when looking at system results on the large SPEC95 summary page, note the number of CPUs listed and, if more than 1 CPU is in use, take special note of whether or not the 'Software Information' section on the detailed results page includes any mention of any kind of compiler parallelising option. These two factors should determine how you treat a particular result.

Well, that's enough moaning for now. Suffice to say that I think mixing parallelised results with non-parallelised results is really silly. It's made an already confusing situation much worse. There ought to be separate benchmarks for parallelised results, eg. SPECPAR95 (SPECpar_int95 and SPECpar_fp95, or something similar). I hope this happens with SPEC98.


Ian's SGI Depot: FOR SALE! SGI Systems, Parts, Spares and Upgrades

(check my current auctions!)
[WhatsNew] [P.I.] [Indigo] [Indy] [O2] [Indigo2] [Crimson] [Challenge] [Onyx] [Octane] [Origin] [Onyx2]
[Future Technology Research Index] [SGI Tech/Advice Index] [Nintendo64 Tech Info Index]