Nearest Power of 2 quickly [Archive] - Pascal Game Development

jdarling

10-01-2007, 02:02 PM

For my Tile Splicer I need a routine to calculate the nearest power of 2 for the output image. I know that the following code will work just fine, but it seems that their really should be a faster way (btw: I know that a shift is better then a *, but the optimizer takes care of it in this case).

if frmGenerate.SizePower2 then
begin
p2 := 2;
while p2 < imgOut.Width do
p2 := p2 * 2;
imgOut.Width := p2;
p2 := 2;
while p2 < imgOut.Height do
p2 := p2 * 2;
imgOut.Height := p2;
end;

I know this is a simple, stupid question, but I've been stuck for a bit trying to find a better way :)

WILL

10-01-2007, 02:57 PM

Have you tried using a pre-calculated table?

You can make it start at 2 and run up to whatever the highest size would be then you'd simply only have to run through the array until you find a size that fits. :)

dmantione

10-01-2007, 03:28 PM

This is a typical situation where some assembler can be usefull. Of course, always {$ifdef} the assembler code and put a Pascal version in the {$else} part to not hurt portabilty.

if frmGenerate.SizePower2 then
begin
w:=imgout.width;
h:=imgout.height;
asm
mov ebx,w
bsr ecx,ebx
mov ebx,1
inc ecx
shl ebx,ecx
mov w,ebx

mov ebx,h
bsr ecx,ebx
mov ebx,1
inc ecx
shl ebx,ecx
mov h,ebx
end;
end;

10-01-2007, 04:32 PM

Here's an alternative code I've been distributing in Asphyre package:

function NextPowerOfTwo(Value: Integer): Integer;
begin
Result:= 1;
asm
xor ecx, ecx
bsr ecx, Value
inc ecx
shl Result, cl
end;
end;

function CeilPowerOfTwo(Value: Integer): Integer;
asm
mov ecx, eax
xor eax, eax
dec ecx
bsr ecx, ecx
cmovz ecx, eax
setnz al
inc eax
shl eax, cl
end;

JernejL

10-01-2007, 07:03 PM

OH for crying out loud :P Here are two no-assembly, portable algorythms i use, first one i came up with, it's based on idea of second algorythm, which is from wikipedia (or the other way round).

if (number and (number - 1) <> 0) // FAST POT CHECK

trunc(power(2, trunc(Log2(number)))); // MAKE NUMBER A POT, ROUND UP

dmantione

10-01-2007, 07:15 PM

:shock: :shock: Do you have any idea how slow a trunc is?

No, for speed you need to do it with integer logic. This is a way how you can do it in Pascal:

dec(v);
v:=v or (v shr 1) or (v shr 2) or (v shr 4) or (v shr 8 ) or (v shr 16);
inc(v);

... but even this one is slower than the assembler one. The ancient art proves itself usefull once more!

JernejL

10-01-2007, 09:09 PM

:shock: :shock: Do you have any idea how slow a trunc is?

why would trunc be slow? it executes on the FPU, right?

dmantione

10-01-2007, 09:32 PM

:shock: :shock: Do you have any idea how slow a trunc is?

why would trunc be slow? it executes on the FPU, right?

This is the trunc implementation in FPC

{$define FPC_SYSTEM_HAS_TRUNC}
function fpc_trunc_real(d : ValReal) : int64;assembler;compilerproc;
var
oldcw,
newcw : word;
res : int64;
asm
fnstcw oldcw
fwait
movw oldcw,%cx
orw $0x0f00,%cx
movw %cx,newcw
fldcw newcw
fldt d
fistpq res
fwait
movl res,%eax
movl res+4,%edx
fldcw oldcw
end;

Note all the hackery with the FPU control word, necessary to make the FPU truncate instead of round. Tinkering with the fpucw is terribly expensive.

cragwolf

11-01-2007, 01:13 AM

Never mind.

Nitrogen

11-01-2007, 09:24 AM

I cant think of any situation where a power of 2 calculation is not fast enough...

You normally use this when loading textures in which case the disk access is going to slow it down no matter what. I know jdarling didnt ask for the fastest thing out there, but it's a classic example of the 90% - 10% optimisation mantra:

90% of the code you write will be executed once or twice
the other 10% will be executed very many more times (thousands of times per second?)

And to optimize code that falls into the first 90% is a waste of time. You need to figure out if this code really falls into that last 10% before you start worrying that the trunc() procedure is not fast enough :wink:

Just for luck, this is how I implemented it.. using the suggested lookup-table.

const
Pow2: array[0..12] of integer = (2,4,8,16,32,64,128,256,512,1024,2048,4096,8192 );

Function GetNextPow2(Wid: integer): integer;
var I: integer;
begin
Result := Wid;

for I := 1 to 12 do
begin
If Pow2[I] >= Wid then
begin
Result := Pow2[I-1];
exit;
end;
end;

end;

Function GetPrevPow2(Wid: integer): integer;
var I: integer;
begin
Result := Wid;

for I := 12 downto 1 do
begin
If Pow2[I] <= Wid then
begin
Result := Pow2[I-1];
exit;
end;
end;

end;

Paizo

11-01-2007, 09:24 AM

with recent compilers improvements do you think that assembler is still faster?
I mean that compilers can improve execution speed on differents architectures (ie amd,intel...), assembly can't be optimized by compiler (i think). just for know what you think about :)

dmantione

11-01-2007, 09:52 AM

Oh yes, especially in this cases like this where assembler wizard tricks can be applied.

Generally, compilers do a good job, but still... FPC, Delphi, GCC, Borland C, MSVC are beaten easily by assembler programmers. There exist a few compilers that are hard to beat, for example Pathscale really kicks ass.

dmantione

11-01-2007, 09:56 AM

Just for luck, this is how I implemented it.. using the suggested lookup-table.

What is the advantage over the original method? I.e. your algorithm compares with 2,4,8,16,32, which was exactly what the original did, only yours needs more memory.

cairnswm

11-01-2007, 11:00 AM

Now this is how to do it with a lookup table :)

Const
Twos : Array[1..11] of LongInt = (1,2,4,8,16,32,64,128,256,512,1024);

var
N : Array[1..1024] of Integer;

procedure PreparePowerTwos;
Var
I,L : Integer;
begin
L := 1;
For I := 1 to 1024 do
Begin
If Twos[L+1] - I < I - Twos[L] then
L := L+1;
N[I] := L;
End;
end;

Function ClosestTwo(InNum : Integer) : Integer;
begin
Result := Twos[N[InNum]];
end;

Seems fast :)

Nitrogen

11-01-2007, 11:38 AM

What is the advantage over the original method? I.e. your algorithm compares with 2,4,8,16,32, which was exactly what the original did, only yours needs more memory.

Well, I never claimed it was better, just showing a different method.
But I do think it's better to "waste" a whole 52 bytes of memory rather than doing a multiply operation for every power.

But, it's still a fun exercise to think up ridiculously optimised functions, so keep em coming.

Now this is how to do it with a lookup table :)

Cheater! ;)

Now that's more of a waste of memory (a whole 4.14kb *shock* *horror*).. But should run very nicely!

I should also point out that most games these days are using up to 2048x2048 textures so you actually need a massive 8.4kb of memory...

dmantione

11-01-2007, 12:23 PM

What is the advantage over the original method? I.e. your algorithm compares with 2,4,8,16,32, which was exactly what the original did, only yours needs more memory.

Well, I never claimed it was better, just showing a different method.
But I do think it's better to "waste" a whole 52 bytes of memory rather than doing a multiply operation for every power.

For a real multiply, yes. But a multiply with 2 wil be translated to a "add register,register", which is one of the cheapest operations in the CPU. Luckily a table lookup is in general as fast, unless you have a cache miss, which not something to really bother about in this kind of code. Therefore, in practise both algorithms will perform similar.

Nitrogen

11-01-2007, 12:32 PM

I ran a few benchmarks, using the average as reported by GetTickCount over 5 runs of 1,000,000 randomly generated numbers each.

5th place was Delfi's trunc(power(2,trunc(log2(number)))); with a dreadful 146.80 ms average :)

4th place was my code :oops: with a 40.60ms average...

3rd place was jdarling's original code! Well Done! with an average of 25.00ms.

2nd place was Lifepower's NextPowerOfTwo() at 18.80ms
(which returns the next highest power if given a power, so 1024 returns 2048...)

And the winner is... Cairnswm! with an average of just 6.20ms, calling the PreparePowerTwos function once at the start of every run.

For a real multiply, yes. But a multiply with 2 wil be translated to a "add register,register", which is one of the cheapest operations in the CPU

I guess I stand corrected. Lookups are not really faster in this case unless you go all the way!

dmantione

11-01-2007, 01:04 PM

Can you benchmark my Pascal solution as well:

dec(v);
v:=v or (v shr 1) or (v shr 2) or (v shr 4) or (v shr 8 ) or (v shr 16);
inc(v);

?

Paizo

11-01-2007, 01:10 PM

pretty nice read benchmarks result :D
like a little competition :D

jdarling

11-01-2007, 01:40 PM

Well, ask and you shall recieve :). Seems there are more ways to calculate the nearest power of 2 then I would have thought.

In the end, I realized that the code was (and is) one of the least ran bits, so it didn't make much sense to optimize it just yet. You can see that in the download of the tile splitter in my other post. Though, I might take a look at a different implementation down the road a bit.

Since I am going for cross platform code, I wanted to stay away from assembly code as much as possible.

Thanks everyone, and I'd love to see some compos here on PGD for fastest implementations of different common tasks. In fact, WILL and I were talking over it on IM the other day.

BTW: The fastest way to multiply by 2 is a shift, fastest way to divide by 2 is also a shift :). Single clock cycle to execute (considered free). An add takes a minimum 2 clock cycles (also considered free by many people).

dmantione

11-01-2007, 01:50 PM

Adding uses 1 cycle on any cpu since the Pentium 1. Shifting too, with the exception of the Pentium 4, where it is very slow. Therefore, FPC has been changed to use an add for a multiply by 2. FPC will not avoid a shift for other powers of 2, as this would hurt performance on other cpu's.

JernejL

11-01-2007, 02:00 PM

seems i need new trunc implementation, any ideas how to make trunc faster? :D

dmantione

11-01-2007, 02:11 PM

You can use SSE3. But that is not an acceptable solution yet. Trunc performance is a known issue unfortunately...

Nitrogen

11-01-2007, 02:36 PM

seems i need new trunc implementation, any ideas how to make trunc faster? :D

Are you sure it's the trunc()'s causing the problems?
I would have guessed that nasty Log2() would slow it down quite a bit?

Can you benchmark my Pascal solution as well:
Sure, I thought it was just an implementation of the trunc() function, but I'll run it through the test when I get back from work...

JSoftware

11-01-2007, 02:47 PM

instead of using integers in cairnswm's N lookup table, couldn't you use bytes and get both a performance and size win?

Robert Kosek

11-01-2007, 02:57 PM

This is a rather interesting and entertaining thread. :)

I should also point out that most games these days are using up to 2048x2048 textures so you actually need a massive 8.4kb of memory...Oh my! :o You mean that my new system with 2mb of memory might not be enough? Oh wait... typo, I meant gig, not meg. :lol:

Though it you are programming for an embedded system that 4-8kb of memory does count, our systems (even embedded) have ever increasing capabilities and RAM.

Cairnswm: I like your method of precalculation. :D

11-01-2007, 05:07 PM

I cant think of any situation where a power of 2 calculation is not fast enough...
I've used this in Perlin noise in real-time. If you can't think on some situation, it doesn't mean that such situation doesn't exist.

Now this is how to do it with a lookup table :)
You'd extend this method up to High(Integer), instead of 2048. You never know. ;)

2nd place was Lifepower's NextPowerOfTwo() at 18.80ms
(which returns the next highest power if given a power, so 1024 returns 2048...)
Have you checked CeilPowerOfTwo? Would be interesting to compare the two. :)

with recent compilers improvements do you think that assembler is still faster?
I mean that compilers can improve execution speed on differents architectures (ie amd,intel...), assembly can't be optimized by compiler (i think). just for know what you think about :)
This argument is common these days, especially in MSDN documentation, where they explain that we don't need assembler on 64-bit systems anymore. Ironical though, their own JIT doesn't optimize much of the code on the fly "to increase startup speed".

Compiler will do a better optimization job than a human, when it will be able to write code instead of yourself (pro genetic programming? :D), but then it won't be a compiler, would it? Until then, humans will still be able to perform the task better.

I should also point out that most games these days are using up to 2048x2048 textures so you actually need a massive 8.4kb of memory...
...then I guess the game will only be able to have 16 different textures on my 256 Mb video card. :cry:

WILL

11-01-2007, 06:05 PM

And the winner is... Cairnswm! with an average of just 6.20ms, calling the PreparePowerTwos function once at the start of every run.

Did I not suggest a look-up table? :lol:

Nice job William.

Though I'm sure that the speed increase would only be marginal via assembly, am I right?

btw how costly would greater than or lesser than evaluations be? I'd assume not too much as they are quite common. A straight equals evaluation would be super-cheap I'd figure am I right?

dmantione

11-01-2007, 06:10 PM

I'm not sure what you you mean. A table lookup would be about as fast in Pascal as in assembler, but that is because it is a micro-operation where not much freedom exists to do things in different ways, which also means there is too little flexibility to do assembler wizardry.

However, for many calculations there are generally enough things assembler coders can do to obtains superior results.

Any comparison, be it less than, greater than, equal or whatever translates to the cmp instruction, which evaluates in one cycle, so is extemely cheap. Because some instructions already set the flags, the cmp is not always necessary and the comparison is even free (the jump after it might still be expensive depending on correct branch prediction). For example:

if a=2 {This one requires a cmp register,2}
if a and $f<>0 {This one does not require a cmp, and therefore is free.}

cairnswm

11-01-2007, 07:38 PM

So how much of the 6ms on my solution is used to call the function when you could instead just do the return value directly :)

I use a lot of similar lookups in my code. For example on a 2D map to calculate direction I'll store it in an array and just be using the characters facing I'll access the array to work out how much the x and y values change.

I use this sort of data caching a lot in business apps I write - even to the point of caching complete tables within oracle packages to make access faster.

instead of using integers in cairnswm's N lookup table, couldn't you use bytes and get both a performance and size win?

No because byte only goes to 256 and these numbers are larger - but maybe SmallInt could save space.

I decided a long time ago that memory space was there for programmers to use and not for users to try and save - so a cahe is always cool :)

dmantione

11-01-2007, 08:05 PM

Well you can store a shift count in the byte, and then to shl, right?

JSoftware

11-01-2007, 08:12 PM

instead of using integers in cairnswm's N lookup table, couldn't you use bytes and get both a performance and size win?

No because byte only goes to 256 and these numbers are larger - but maybe SmallInt could save space.

The "Twos" array only has a range og 1..11 which means a byte will suffice :wink:

Const
Twos : Array[1..11] of LongInt = (1,2,4,8,16,32,64,128,256,512,1024);

var
N : Array[1..1024] of Byte;

just cut of 1024*3 bytes of ram

WILL

11-01-2007, 08:26 PM

I just mean that the difference between the two ways would probably not be too big as far as speed.

Nitrogen

12-01-2007, 06:24 PM

In a quite surprising move, the shl or shl or shl or shl or ... etc. code is actually faster than the assembler! That's one clever function! Unfortunately it just doesnt beat a brute force lookup!

Also interestingly, I tried cairnswm's algorithm as a function returning one integer and also straight as an inline lookup and according to the graph below, the overhead incurred by the function is around 640 millionths of a second (0.00064) per execution.

http://www.nitrogen.za.org/junk/power2.jpg