The "many images on one bitmap" is the most obvious and natural way to do it. Whether the images are split up into an array at load-time or kept as one big bitmap depends on the API. With 3d acceleration, state changes (e.g. binding textures) are expensive, so you want to minimize them. So, with 3d APIs you'd probably keep the image as a whole and copy individual frames out using texture coordinates (no mipmaps, of course, and probably no filtering, plus the usual 2d-in-3d requirements). It doesn't matter too much what you do with software APIs though, so you could choose either method (keep as whole or split into array).

Deciding when to change between frames is simply a matter of logic (and preference). If using frame rate-dependent timing, you could use a integer counter and increase it. Once to a certain level, change to the next frame and reset. With delta time you'd keep a real number for the time passed, and add the frame's delta to the time passed. Once it got past a certain amount of time you'd swap over. Having different rates of animation for different frames is simply a matter of perseverence. You can code any sort of frame changing logic if you take enough care to use variables instead of constants for everything, and if you have enough patience.