Picking should absolutely not be used for a voxel engine (Please forgive me Sascha, I've got an awesome amount of respect for you) you're hitting the limits of the hardware (or at least eventually will) the last thing you want is to render an additional pass of the scene for only a single task. Regardless if you remove textures or not, that's a hell of a lot of geometry to render twice.

You could alleviate much of the performance hit by using MRTs (Multiple render targets) and actually render the picking data to a back-buffer, whilst doing the normal rasterization on your main buffer (or a further back-buffer if you're aiming for a deferred renderer). If you were looking to do minecraft style rendering, IE Ambient Occlusion, there may be additional data you can store in this back-buffer that would make this method more attractive.

However you're just not going to beat raycasting when you've got a lot geometry, especially as you're storing your data in an oct-tree, there are oct-tree optimized raycast algorithms that you really want to use and not just for ray-casting the view direction to select stuff etc. if you want to do oct-tree style ambient occlusion you'll need rays, if you want to do any form of path finding in a voxel terrain, you'll need rays.

The math isn't too complex, Jink (my soon to be released game engine) has a full range of ray-casting functions and algorithms, optimised for oct-trees, kd-trees etc

Ray-casting, is a requirement for rendering 3D graphics onto a 2D display (I didn't say Ray-tracing before anybody jumps on me). it's all happening in the API even if you don't use it yourself, projecting the 3D vertices into screen-space coordinates.

You're just casting in the opposite direction to do picking, you cast a line from the position the mouse intersects the camera clipping plane, outwards, the combination of view and projection matricies associated with the camera determining the two intersecting planes the ray is defined by (or a vector and position, whatever is most useful for your spatial partitioning scheme)

Once that is done you have a vector in 3D space, for raycasting, think of it as an infinite line.

Then you're either finding the closest 3D object to that line, or doing line/BBox followed by line/triangle intersection tests to determine the 'hit' object.

Obviously oct-tree optimizations come into play at this stage, testing the intersection of this line against the bounding boxes of your tree nodes, you then perform this test, traversing down towards the leaves like you would when you're testing your camera frustum against it, only line/bbox is a lot faster than Frustum/BBox intersection testing.

You further optimize the traversal because you should have stored your visible nodes during the frustum test, so you only need to test against that set of nodes.