The radix sort is a bin based sort. This means it doesnt do comparisons of each element, and thus can and probable will run under the O(N*log N) lower limit for comparison based sorts. Bin based sorts work best on fixed length data. A bin sort works by creating an array of bins. 1 bin for each unique element. Then using that element as an index into the bin array, you append the key to that bin.


Code:
  ElementCount: Number of unique elements in the key
  source:     List of keys
  dest[ ElementCount ]:  ElementCount lists of keys. each list should have 
                         enough space to hold source.length elements.

a pseudo-code would sort the list this way:

  for i = 0 to source.length-1 do
    dest[ source[i] ].append( source[i] )
As you can see this is very very fast(it just moves data around), but incredible memory intensive. It requires Number of Elements * length of source data * size of each key memory. So to sort:
100 bytes would require: 2^8 * 100 = 24 kb
10k bytes would require: 2^8 * 10000 = 2.4 mb
100 longwords would require: 2^32 * 100 * 4 = 1.56 gigabytes
10k longwords would require: 2^32 * 10000 * 4 = 156 petabytes

As you can see it requires a vast amount of memory, enough that even sorting a small number (100) 32 bit numbers just isnt going to work. Heck, sorting 100 bytes is prohibitly expensive, memory wise. But there is away to make it usable. One of the bigger problems is it requires n * SourceData.length buffers to sort the bins in. This is a waste since at the very most only 1 bin can ever be full at any time.

so we need to count the number of elements we have.

Assuming we are dealing with bytes:
Code:
var
  Histogram : array [0..255] of integer;
...
// Initialize the histogram to all zeros  
fillchar(Histogram[0], sizeof(Histogram), 0 );
// Read through the source data & count the elements & store them in the histogram
for index := 0 to length(source) do
  inc( Histogram[ source[index] ] );
This gives use a table which tells us exactly how much memory we need. Which is exactly the length of the source data. So instead of allocating lots of little buffers, we allocate 1 big buffer & output all the data to there. This is probable the best we can do to reduce memory requirments. But now we need some way to tell us were to move the data. We need some type of offset table:

Code:
var
  OffsetTable : array [0..255] of integer;
...
OffsetTable[0] := 0;
for index := 1 to 255 do
  OffsetTable[ index ] := OffsetTable[ index-1 ] +Histogram[ index-1 ];
This creates a table which allocates a block of space in the dest buffer for each byte. Bytes which doent exist in the source data dont get a unique location, but since they dont exist in the source data those offsets should not be touched. However to insure each byte that does get moved gets a unique spot, we need to update the offsettable. The easest way, is todo that as we are moving data.

Now the code to actually move the data in order:
Code:
for index := 0 to 255 do
  begin
  // move the data
  dest[ offsetTable[ Source[ index ] ] ] := Source[ index ];
  // update the offset table, so the next byte that is equal to 
  // the current byte gets a unique location
  inc( offsetTable[ Source[ index ] ] );
  end;
But this is only for an array of bytes, and it isnt practical to extend to offset table & histogram to accommodate the larget basic types. However, we are sorting via the most fundementel elements a byte. Wouldnt it be good todo this on the larger types since they are just a collection of bytes? The answer is yes you can.

A basic radix sort to sort an array of longwords (4 bytes, 32 bits unsigned):

[pascal]
Type
TArrayOfLongword = Array of longword;

Procedure byteRadixSort(ShiftCount : integer; source, Dest : TarrayOfLongword);
var
Histogram : array [0..255] of integer;
OffsetTable : array [0..255] of integer;
Index : integer;
Value, ValueToIndex : Longword;
begin
// biuld the histogram
fillchar(Histogram[0], sizeof(Histogram), 0 );
for index := 0 to length(source) do
begin
Value := source[index];
ValueToIndex := ( Value shr ShiftCount ) and $FF;
inc( Histogram[ ValueToIndex ] );
end;
// biuld the offsettable
OffsetTable[0] := 0;
for index := 1 to 255 do
OffsetTable[index] := OffsetTable[index-1] + Histogram[index-1];
// do the inner loop of the radix sort
for index := 0 to length(source) do
begin
Value := source[index];
ValueToIndex := ( value shr ShiftCount ) and $FF;
dest[ OffsetTable[ ValueToIndex ] ] := Value;
inc( OffsetTable[ ValueToIndex ] );
end;
end; {byteRadixSort}

Procedure Radixsort(Source : TarrayOfLongword);
var
tmp : TarrayOfLongword;
begin
if length(source) < 2 then exit;
// create the tmp buffer
setlenght(tmp, length(source));
// sort based on the 1st byte
byteRadixSort(0, tmp, source );
// sort based on the 2nd byte
byteRadixSort(8, source, tmp );
// sort based on the 3rd byte
byteRadixSort(16, tmp, source );
// sort based on the 4th byte
byteRadixSort(24, source, tmp );
end; {Radixsort}
[/pascal]
Since the byteRadicSort method only sorts based on byte values, we need to extract the byte we are going to manipulate. We take advantage of the fact you can sort in multipule passes instead of 1 single pass.

This can be optimized significatly. It currently does 10 reads & 4 writes of the source data, is there any way to improve it? The answer is yes. All the histograms & offset tables can be calculated at once. A seperate offset table array isnt even needed, you can transform the histrogram table into the offset table in the same memory space. This alone drops the number of reads down to 5! And since the main limitation is how fast you can read the memory in, you might as well do some calculations. Thus in the loop which biulds all the histograms, it can detect when a pass is not needed and when the data is sorted! This will cut 2-4 reads of the source data off and 1-4 writes off! Another optimiztaion is to hint to the processor to preload the source data into the cache during the histogram biulding & the actual inner loop. This will drastically reduce cache misses during reads, which will help a lot.