AES Algorithm Efficiency

AES Algorithm Efficiency

This page gives the results I have obtained in implementing  AES candidate algorithms from an efficiency perspective.

Results

The figures given in the table below are in clock cycles for key set-up and clock cycles per block for encryption and decryption.  The rows giving speed figures provide encryption and decryption speeds (and their average) for 128 bit keys on the 200 MHz Pentium Pro reference platform without any I/O byte/word/block inversion costs. The key set-up timings are for computing both the encryption and decryption key schedules.   For Rijndael and CRYPTON the cost of the key schedule for decryption is much higher than that for encryption so two figures are given for these algorithms (the encryption and decryption key schedule cycle counts respectively).   The encryption and decryption speeds for 128 bit keys and full input/output byte order conversion are provided below the main figures where these are different (note, however, that these are not fair for comparison purposes since many of the AES algorithms chose arbitrary byte order conventions).

Here (as a result of many requests) are the source code files used to obtain these results:

For comparison DES in C (des.h, des.c and key.c), optimised for the Pentium Pro, requires 456 cycles for each 64 bit data block, corresponding to a speed of 28 Mbits/second on this platform running at 200 Mhz (faster than many AES design teams appear to believe).

Algorithm RC6 Rijndael MARS Twofish CRYPTON CAST E2 Serpent DFC HPC SAFER LOKI FROG DEAL MAGENTA
Source Code rc6.c rijndael.c mars.c twofish.c crypton.c crypton1.c cast.c e2.c serpent.c dfc.c hpc.c safer.c loki.c frog.c deal.c magenta.c
Key Set-up (128) 1632 305:1389 4316 8414 531:1369 744:1270 4333 9473 2402 5222 120749 4278 7430 1416182 8635 30
Key Set-up (192) 1885 277:1595 4377 11628 539:1381 748:1284 4342 9540 2449 5203 120754 7426 7303 1422837 8653 25
Key Set-up (256) 1877 374:1960 4340 15457 552:1392 784:1323 4325 9913 2349 5177 120731 11313 7166 1423613 11698 37
Encrypt (128) 270 374 369 376 474 476 633 687 952 1203 1429 1722 2134 2417 2339 6539
Encrypt (192) 267 439 373 376 473 469 633 696 952 1288 1477 2555 2138 2433 2358 6531
Encrypt (256) 270 502 369 381 469 470 639 691 952 1178 1462 3391 2131 2440 3115 8711
Decrypt (128) 226 352 376 374 474 470 634 691 914 1244 1599 1709 2192 2227 2365 6534
Decrypt (192) 235 425 379 374 470 470 633 693 914 1235 1599 2530 2189 2255 2363 6528
Decrypt (256) 227 500 376 374 483 469 638 706 914 1226 1526 3338 2184 2240 3102 8705
Encryption speed (128) 94.8 68.4 69.4 68.1 54.1 53.8 40.4 37.3 26.9 21.3 17.9 14.9 12.0 10.6 10.9 3.9
Decryption speed (128) 113.3 72.7 68.1 68.4 54.1 54.5 40.4 37.0 28.0 20.6 16.0 15.0 11.7 11.5 10.8 3.9
Mean speed (128) 103.2 70.2 68.7 68.3 54.1 54.1 40.4 37.2 27.4 20.9 16.9 14.9 11.8 11.0 10.9 3.9
Encryption speed (128, full)             37.8 35.1 26.0 20.5 17.3 14.6 11.8 10.4    
Decryption speed (128, full)             38.3 35.4 26.8 20.9 16.0 14.8 11.5 11.2    
Min. Secure Rounds 20 8 10 12 11 11? 10 10 17 9 8? 7 >36

8?

6 >10
Min. Secure Cycles 245 344 240 300 440 440 550 590 483 1353 1466 1498 4840

2500

2550 11000
Diffusion, 1   round 15 17 21 17 12 12 41 39 13 17 64 57 13 47 17 18
Diffusion, 2 rounds 42 64 47 49 48 48 64 48 58 49 64 64 41 64 49 49
Overall Diffusion Count 420 332 370 388 288 288 495 468 940 192 512 468 328 376 147 147

Comments

1. The order of the algorithms in the table is based on mean encryption/decryption speed for 128 bit keys.

2. Different designers have chosen a different number of rounds for their algorithms and this clearly impacts in a major way on performance.  This is illustrated by Serpent in particular, which has a much higher diffusion count than other algorithms.

3. Both DFC and HPC make use of 64 bit arithmetic. Since this will be the norm by the time the AES algorithm is selected it makes little sense to judge the speed of these algorithms on 32 bit architectures.

4. I suspect that there is quite a bit that can still be done for E2 and SAFER+ since I have not spent a great deal of time on these.  I am grateful for suggestions from Kazumaro Aoki of NTT Japan for speeding up my E2 implementation.

5. For TWOFISH I have pushed encryption/decryption speed without concern for the impact on the cost of the key schedule or memory use (I employ 8K+ bytes of table space).   There are many optimisation alternatives for this algorithm and major reductions in key schedule and memory cost can be achieved if high volume encryption/decryption speed is not the primary concern. My thanks to Niels Ferguson for suggesting some useful speed ups for my original implementation.

6.  Where an algorithm meets a superset of the AES requirements I have only implemented the AES compliant elements.  Hence Rijndael is only implemented for a block size of 128 bits and for HPC only HPC(medium) has been coded. 

7. For Crypton I have implemented both the version submitted for AES and the revised version 1 (issued on December 21st 1998). 

8. The DEAL timings are based on the DES source code in which the IP/FP permutations have been implemented in DEAL and not in DES (as described in the DEAL specification).  

9.  I have implemented Loki97, Frog and MAGENTA for completeness although it appears that they have all been broken. 

Security versus Performance Issues (HEALTH WARNINGS)

The work presented here is primarily about performance but it is not sensible to consider this in complete isolation from security since there are inevitably trade-offs between these attributes in the real world. One obvious example is the number of rounds employed by each algorithm and another is the quality of each round in achieving the 'mixing' that the cipher is intended to provide.   The lower rows in the table are a first attempt to consider the ways in which security and performance are being traded off by each of the AES algorithms.

Diffusion.  The lower rows dealing with diffusion should not be seen as a direct measure of algorithm security.  They simply give an indication of how good the algorithms are at spreading out the influence of input bits on output bits.  Obviously, diffusion counts say nothing about the 'quality' of the diffusion achieved and this alone means that this sort of analysis needs to be treated with extreme caution.

Minimum Number of Secure RoundsEli Biham has suggested that AES algorithm performance should be measured by timing the 'minimum number of secure rounds' for each candidate - that is the estimated number of rounds needed to make a brute force key search the most efficient form of attack.  This is a controversial suggestion that some AES participants do not accept even in principle  (for example, see the paper by Bruce Schneier and the Twofish team).  In my view, however, we have to look at this in two steps - it seems to me that the principle has merit but that there is (and may continue to be) a problem in obtaining impartial and widely accepted values for the minimum number of secure rounds for each of the AES algorithms.  

At the moment the table uses figures equivalent to those suggested by Eli Biham.   However as an AES algorithm contributor, Eli will not been seen by all as an impartial observer and it would hence be preferable to have a wider community basis for setting the values to be used in such a process.  For the present therefore it is important to treat these figures (and any conclusions derived from them) with caution.

Implementation Assurance.  It seems certain that there will be a significant number of the AES algorithms that will be judged to be secure at an algorithm level.  In practice, however, most failures in cryptographic systems derive not from weaknesses in the algorithms used but rather from the exploitation of subtle flaws in the way the algorithms are implemented or through the exploitation of interactions between algorithm implementations and the environments in which they operate.  This means that it is vital to judge implementations not simply from the perspective of the performance that they provide but also from the perspective of 'implementation assurance' - can we be confident that the implementation does what the algorithm requires, no more and no less, and can we also be confident that it does not interact with the wider environment in ways that can be exploited to undermine the security that the algorithm itself is otherwise capable of providing.

To answer these questions we need much more hard evidence derived from the implementation of the AES candidates in environments that are representative of practical use. In particular we have very little evidence at the moment of how the AES candidates behave in low end environments of the kind that may apply in much of electronic commerce and consumer applications.

We also need to understand whether the different AES algorithms are architecturally different in their ability to support implementation assurance techniques.  None of the existing specifications are precise enough to support formal design and analysis methods and it even seems possible that at least some of the algorithms on offer are sufficiently complex to make these techniques difficult or even impossible to apply.    This may also be true of other approaches for implementation assurance.    This is an area where much more work is needed before any final algorithm choices are made.

Performance Measurement Approach

All the implementations provided and discussed here have been coded from scratch using the AES specification documents without prior reference to the source code provided by the design teams.  However, comparisons with the reference code has sometimes been necessary because the specification and reference code for some algorithms are not consistent.

I have tried to keep a fairly consistent style across the source code to aid fair comparison.   All the routines have been biased towards encryption and decryption speed without regard to key set-up or to memory use in tables.  I have also used VTune, the Intel performance tuning software, to look for optimised code sequences.  This was kindly supplied by Intel and I much appreciate their generous support.  The code has been optimised for the Pentium Pro/II and it is hence likely to be relatively poor on the Pentium.  Although I have pushed for speed, I have not used some techniques where the resulting source code has become especially obscure.

All the algorithms and results reported are based on implementation using Microsoft Visual C++ version 6.    I have generally coded in standard C but I have used non-standard features where this is reasonable and would be done in practice.  For example I have used the intrinsic circular rotate operations that Microsoft C provides.  Some of the algorithms are endian neutral, others are definitively either big-endian or little-endian.  Most of the difficulties I have had with implementation relate to confusion about endianness.  I am running on a little endian machine (a Pentium II) and I have checked out the code in this context only.  My inputs and outputs to algorithms are in 32 bit words, not bytes, and this impacts on the byte numbering conventions that are used.   The algorithm source code provided below contains code to invert byte order on input and output to match the standard test values but timings do not include these sections of code.  

Please bear in mind that there may be mistakes in the source code - please let me know if you find any. If you use any of it, be especially careful about endianess since I have not designed to eliminate such problems.

I am happy for the source code given here to be used provided that any conditions set by the algorithm designers are complied with.  My only additional requirement if you do use it is that you acknowledge its origin.  At the request of NTT, E2 source code is available for academic study purposes only.

Timing

Obtaining accurate and repeatable timings has proved more difficult than I expected.   After many experiments I have settled on the use of the Pentium Pro/II cycle counter as the basis for timing.  For each of the three routines in question (key set-up, encrypt and decrypt) I time two code sequences containing respectively single and double calls to the routines in question (with random key and data inputs).  After running the routines to eliminate initialisation effects I then time these two sequences 100 times each and take the difference between their minimum timings as the cycle count for the routine in question.   My thanks to Doug Whiting for a useful exchange of views on timing issues.

Fair Algorithm Comparison - Timing and Byte Order

There is scope for confusion in the definition of the AES interface on whether any required changes to byte order lie are internal or external to this interface.  The programming interface is byte oriented and suggests that byte order changes are not needed. But the test vectors are 'numeric' and this suggests that such changes are needed as a part of the AES algorithms.  I cover this in more detail here.

A consequence of this uncertainty is that some design teams have implemented byte order inversion for input and output blocks whereas others have not.  Since byte order change involves processing costs this can impact on performance comparisons.

I am running on a little-endian machine (Pentium II) and this means that algorithms which involve byte, word or block order inversion ('big-endian' algorithms)  are at a disadvantage if their speed is compared with 'little-endian' algorithms in my environment.  Equally if I were using a big-endian machine it would be 'little-endian' AES algorithms would be put at a disadvantage.  In order to obtain fair comparisons I have therefore chosen to base my timings on those that can be achieved when an algorithm is run on a machine that matches its own endianness.  In practice this means that I do not include the processing cost of byte, word or block order inversions on input and output in my timing measurements.  For some algorithms there are more deeply buried endian issues but, fortunately, these turn out to be small in comparison with the costs involved in input/output byte order changes. I hence believe that the timings given here are a fair assessment of what the AES algorithms can achieve when run in environments that match their own design endianess.  

Since other comparisons, for example, that being compiled by Helger Lipmaa, include results for both big and little endian processors it is important that they use full timing data including endian conversion costs (however care has to be taken to ensure that results on big and little endian machines are given equal weight).  The figures I have provided for Helger are hence full timings that include I/O conversion costs.

Diffusion

I have attempted to measure how effective the algorithms are at mixing up bits by measuring their 'diffusion'.  I do this as follows.  After setting up random key and data input blocks I run the encryption round function for the algorithm in question once or twice.  I then change just one bit in the data input block and count how many bits in the output block change. I do this for each bit in the input block and I repeat this for a large number of random inputs to obtain average values.  I then calculate for each input bit how many output bits change on average; I also calculate for each output bit how many input bits change its value (these two outputs are very similar on average but they are sometimes very different for particular bits).  Since some algorithms operate using 'half-rounds' I use the best of the 1 and 2 round diffusion values to calculate an overall diffusion count for each algorithm.

Acknowledgements

A number of people have contributed ideas for improvements in the source code and the analysis presented here.  I gratefully acknowledge the contributions made by the following people:

Revision History

Changes:   29th October        added ranking rows to table and additional comments
31st October corrected Twofish cycle count per round figure (thanks to Doug Whiting for finding this)
31st October added diffusion per round values
3rd November changed diffusion rows and added a health warning
12th November added performance estimates for the minimum number of secure rounds
30th November added results for HPC and DEAL
6th December revised (improved) RC6 speed
11th December revised speed figures and simplified the table layout
13th December added Frog and Loki97
15th December added MAGENTA - completed implementation of all 15 AES candidates
19th December added improved Serpent figures based on new S box decompositions
19th December corrected CAST figures (error spotted by Helger Lipmaa)
20th December expanded on 'endian' timing issues (at Helger Lipmaa's suggestion)
21st December corrected error in the rc6 implementation and changed timing figures
23rd December added figures for the revised version of Crypton
3rd January 1999 improved implementations of RC6 and MARS
13th January 1999 corrected an error in MARS spotted by Louis Granboulan
14th January 1999 published final (pre-AES2) timing data (including improved Twofish figures)
20th January 1999 updates to timings (improved DEAL, DES and Rijndael)
20th January 1999 changed page format to make results easier to access
22nd January 1999 improved timings for Twofish key setup
23rd January 1999 minor updates to timing data and added figures for 'full cost' encryption
28th January 1999 added text on the (as yet) inadequately discussed issue of implementation assurance
16th February 1999 improved DFC speed

Back to Brian Gladman's Home Page