More than 80% of the variance in SPC-2 storage benchmark results appears to derive from the capacity of the storage system alone (controlling for given, non SSD, drive RPM speed). This statistic makes sense when considering the variables at hand. A fire hose can expel more water per second than a straw, because of the greater thickness of the pipe. Not surprisingly, the same is true with storage. The more disks available to a system, the more IOPS and MBPS it can achieve. When scanning headlines of SPC-2 results though, that message is not entirely clear.
The Y-axis in the graph above shows the results of the SPC-2 benchmarked systems with the sub-set of 15K RPM drives. When the X-axis has undergone a standard translation, a trend line can be fitted that gives a coefficient of determination (the R-number at the top) of .8929. In layman’s terms, the interpretation is that 89% of the variation in the results is due to capacity alone. In fact, though, there are likely to be other variables in play. Confusingly, the correlation between MBPS and capacity is even stronger than MBPS and the number of disks. That discrepancy probably comes from storage technology upgrades. For example, 146 GB drives used a 4 Gb/s FC standard while 300 GB drives typically use 8 Gb/s FC.
Capacity also shows a much higher correlation with MBPS than the available storage units (ASUs), or any of the other variables tested during research. For example, the explained variance above is without controlling for the mixture of configurations - including RAID 5, RAID 6, and mirroring – in the results. When systems are divided into RAID categories, i.e. controlled for the relationship between capacity and MBPS, the link becomes even stronger. However, the number of results within each RAID group is not large, and the trend is less convincing. It is possible that the modeling above can be improved upon. For instance, the results could be tested for heteroscedasticity (i.e. changing variance over time). If there is heteroscedasticity, then different variations of autoregressive conditional heteroscedasticity would prove a better modeling technique.
More telling results could be produced by segmenting systems into capacity cohorts (i.e. ranges of capacity) and determining what causes variation within a given range. Knowing the difference between normalized systems would show which system is better for a given capacity. For the time being, though just knowing how much storage is required to achieve a particular SPC-2 result may be useful information for end users to consider.






I am not sure where we are in disagreement. I said that the MBPS is correlated with capacity (and drive count). You said that drive count is fundamentally limited by the controller. If I was in the fireman business I would say that we have two variables: diameter of the hose and amount of water available. It would be strange to say a fireman couldn't prevent a burned down house because of the causal relation of the diameter of the hose and not the amount of water available.
That said, I try to be careful, while still writing to lay-people, about the causation-correlation link. I believe there are probably other more fundamental variables than capacity. But, to some extent, who cares?
If you are in a windowless building, and can only see the people entering through the door, a good way to check if it is raining is by counting how many come in with umbrellas. Imagine me and a co-worker are walking outside in this situation. I say, “I need to grab an umbrella.” They retort, “but there is no causation between an umbrella and rain.” I would probably tell them “good luck with that” and still grab my umbrella because it is a really good indicator of what awaits - rain.
A good indicator is what we are trying to find here - even without causation.
Posted by: Chris Gaun | August 08, 2011 at 10:10 PM
Don't you think you might be confusing correlation with causation. This is something EMC does in its criticism of SPC-1 also. Vendors are primarily marketing controllers here, so they will keep stacking their controllers with drives until the controllers or pipes choke. i.e. performance is correlated to drive count, but not limited by drive count. So perhaps it is reasonable to consider that SPC does show you at what drive count the controller architecture will choke?
Posted by: Storagebuddhist.wordpress.com | August 08, 2011 at 03:29 PM