Deduplication Doubling

Doubling has been a big theme in the last few weeks. Deduplication vendors EMC, Exagrid, NEC, Quantum, and SEPATON all touted at least 2x performance improvements in their deduplication disk appliances.

EMC Data Domain made headlines promoting its fastest single-node performance. In this case, EMC’s DD860 performs at 5.1 TB/hr in a standard configuration and 9.8 TB/hr with Networker- and Symantec OST-enabled DD Boost, while the DD890 posts throughput of 8.1 TB/hr and 14.7 TB/hr for standard and DD Boost configurations, respectively.

Exagrid recently made hardware improvements to processors and IO subsystems, as well as enhancing memory and connectivity to speed deduplication processing of backup streams in its EX Series systems. With throughput performance of 1.8 TB/hr per node, combining 10 nodes delivers aggregate throughput of 18 TB/hr.

NEC enhanced its HYDRAstor HS8-3000 with a 50% performance increase over its previous version with 2.7 TB/hr per node and a new mark of throughput scalability with up to a whopping 148.5 TB/hr for the largest supported configuration.

After updating DXi hardware platforms in 2010, Quantum kicked off 2011 with updated DXi 2.0 software. The enhanced DXi software includes new file system technology, is architected for new processors, and performs deduplication of data in memory prior to being written to diskall adding to the performance improvements. Now, for example, Quantum claims up to 4.6 TB/hr on a DXi6500 with OST and up to 4.3 TB/hr with a NAS interface.

SEPATON’s S2100-ES2 1910 and 2910 models doubled throughput versus previous S2100-ES2 systems, with throughput of 5.4 TB/hr per node in a 10 GbE and Symantec OST configuration, and a tested maximum ingest rate of 43.2 TB/hr in an eight-node configuration (although the eight-node configuration is not the maximum potential for the solution).

The distinction among these solutions is the underlying architecture. Exagrid, NEC and SEPATON have a grid architecture where scale in throughput performance AND storage capacity is achieved by adding nodes to the configuration. EMC Data Domain and Quantum offer some modularity within a specific system, but ultimately have maximums on throughput and capacity. The advantages of a modular approach are that IT organizations don’t have to over-buy in their initial purchase in order to meet future data growth requirements and don’t have to do a forklift upgrade to the next model for requirements that go beyond upper thresholds of an initial purchase.

Are the throughput numbers confusing? You betit’s not an apples-to-apples comparison. What’s more confusing is everyone claiming “fastest” status … with qualifiers, of course. EMC, for example, claims “fastest single controller” and “fastest inline deduplication” messages in its most recent announcement, deflecting comparison with multi-node solutions or those that deduplicate “post-process.” Does it matter if takes four nodes to make up an equivalent single-controller solution? In my opinion, only if it’s more expensive. Until vendors start promoting a cost/TB/hr metric, buyers will have to do the comparison themselves.