Introduction
Sometimes you don't want to supply 5,000 different files with your project. It's untidy and it uses up quite a lot of space (for example, the images\buttons folder for a default Delphi install has 59K of bitmaps that take up over 5 megs on my FAT32 drive!). The obvious solution here is to crunch all those files together into one file and read each into memory as necessary - much cleaner!
How To Do It
The technique is relatively straightforward, though it presents many annoying traps. We separate the packed file into two sections:
1. The data (each file to be packed)
2. Information about that data (name, size)
The first part is used to store each file's contents (e.g. entire bitmaps). The files will be stored one after another without any space in between. As a result, we need some way to figure out the size of each file and where it is in the overall packed file - this is where the second part comes in!
Let's take a look at the first step, which is to splat each file together into one big gloopy mess. If you're not sure of your file handling, it might be time for you to brush up.
This is the procedure we follow. Note that the following is pseudocode - I'll be adding to it as we go along, so code won't be shown. If you want code, you can download the source code at the end. The pseudocode below will be expanded on, so don't use this version:
Code:
create a new output file
for each file in your list
if current input file exists then
get size of current input file
open current input file
resize buffer to wanted amount
read contents of the input file into the buffer
write the buffer to the output file
close input file
end if
end for
close the output file
(Note: do not use the above pseudocode!)
We want to copy each file over into one gigantic packed file, so we create a new file to store everything. For each to-be-packed file, we read the contents and write to the still-opened output file. This has the result of creating a big file with each individual item squashed together. There's a step in between - we have to use a buffer to store the read information first before we can write it. The bigger the buffer here, the better, up to the file's size. If that's too large, though, you can repeatedly read fixed amounts into a buffer until no more is left for the current file.
However, you might have noticed a small problem with the above pseudocode: there's no way to tell where one file starts in the overall pack, making it less than useful!
Storing File Info
We want to store information about each file somewhere, which will let us figure out the starting position and sizes. My chosen technique is to store the absolute position of each file (offset in bytes from the start of the file, position 0). This can be figured out as you write each to the pack by keeping an integer total and adding file_size to it for each input file. Uh, did I make that less-than-clear enough?
I'm going to refer this file-specific info as a header. It can equally be a footer, of course, depending on where you decide to stick it in the packed file
.
The first problem is determining the format of the header. What info needs to be stored for each entry?
* An ID (string)
* The offset in bytes to this file
Anything else is just frills. The ID is necessary so that you can read the file - you may want to say "load somepic.bmp" instead of "load file number 2," because you may not always know the exact index of each file in the pack. The offset is required so that you know where the file starts. Any other details, such as the size, can be calculated later from this info.
The advantage of using a filename is that the type of the file is easily found - simply check the file extension (e.g. ".bmp" for a bitmap). You can add any other data required as additional fields in the record, though, so bear that in mind. The example record above is the minimum you'd get away with for most cases, but the format is entirely up to you.
Next up, there's a minor question. Should each entry be a fixed size? This is simply for the purposes of reading/writing the header at appropriate times. If each entry is a fixed size then the entire header can be read at once, instead of having a for loop reading each individual entry. It's your call again. I'm choosing to be lazy, so I will use a fixed size record. In fact, I'll go one further and use a shortstring instead of a string for the ID
.
[pascal]type
TPackedFileEntry = record
Name: String[32];
Offset: Int64; // offset in *bytes* to this file
end;[/pascal]
The above is info about each file in the pack. You can adjust the size of "name" there - I used 32, but you can increase or shorten it if you want or use normal, resizable strings instead (which can introduce small complications to the unwary). [The sample code uses a slightly smaller string, in case you were wondering, plus a packed record.]
At this point, it's time for a little update of our wanted format. This is the new format:
Original file - whatever was sitting there, optional
Packed data - our extra files
Header - n entries, one for each file we packed and added [it's a footer actually]
Header size - an Int64 giving the header size (n entries * sizeof(each entry))
A signature - the string 'pack'
What's that? Yep, the header has now become a footer, plus we have some extra stuff at the start and end! This change makes the system more adaptable.
The original file refers to your executable, or indeed any other file onto which a pack can be added. This is optional (because you may want your pack files separate for clarity) but the format allows exes to have extra stuff embedded afterwards. The exe will still load and run, but with the added bonus that it's self-contained with its data.
Think about what happens if we want to add our packed file to the end of something else (for example, our main executable). We wouldn't have any idea of how large the original file was because we've padded it with our extra crap! The obvious (or not) solution is to check backwards from the end of the file. We always know where the end of the file is
.
When reading the file, we have to check whether it contains packed stuff first, since that's not a certainty. The simplest method is to write a fixed number of bytes in a signature pattern (in this case, making the word 'pack', but feel free to use whatever you want) to the end. It's very unlikely that an unpacked file would end in those exact bytes, hopefully, so it serves as a good enough check.
The next step is to read the header info - once we know that, it's plain sailing! The first problem is figuring out where that header starts, though, since it can contain different amounts of files! This can be solved by reading in an offset in the file giving the header size, rather than a fixed position. We can seek back however many bytes from the end (skipping the packed-file ID) and then read the header all at once! From that, we know the position of each file and how big the original file was (the offset for the first file entry).
Phew!
Take a breath and re-read the above. The concept is straightforward enough; the only problem is side-stepping possible implementation difficulties (aka, "fiddly file handling").
Implementation
Here are some pseudocode functions (note the absence of type info, for example):
Generating the packed file
Code:
procedure make pack(filename, files to pack);
begin
if FileExists(filename) then
begin
current pos := file size(filename)
append to file(filename)
end
else
begin
current pos := 0
create new file(filename)
end if
// assume max entries possible - resize later
set header entry count(header, count(files to pack))
next entry := 0
for all files to pack do
begin
if file-to-be-packed exists then
begin
// step one: dump each file into a big messy gloop
open current file-to-be-packed
get file size
resize buffer
read data into buffer from file-to-be-packed
write buffer to output file
// maintain some info for the header
header[next entry]'s name := file-to-be-packed name
header[next entry]'s offset := current pos
close to-be-packed file
next entry := next entry + 1
current pos := current pos + input file size
end if
end for
// resize to the actual amount used
set header entry count(header, next entry)
// step two: write out the header info now
write file header
write file header size
write file signature('pack')
close output file
end
Reading a packed file
function is packed file(somefile): Boolean
begin
Result := (size(somefile) > 4 chars) and
last four chars(somefile) = 'pack'
end
procedure get header(somefile, header)
var
header_size: Int64
begin
if not is packed file(somefile) then
explode horribly
move to end of file(somefile)
// find the header size - ignore the last 4 bytes
seek backwards(4 bytes + SizeOf(Int64))
read header size(somefile, header size)
seek(end of file - 4 bytes - SizeOf(Int64) - header size) // yeesh!
set header entry count(header,
header size div SizeOf(TPackedFileEntry))
read all header(somefile, header, header size)
end
function packed file size(header, which_file): Int64
begin
if which_file is_not_last then
Result := header[which_file + 1].offset - header[which_file].offset
else
Result := (packed_file_size - 4 - sizeof(int64) - header_size) - header[which_file].offset
end
Getting the size is a little fiddly. If we're not dealing with the last file then we can simply take the difference between two entries, since there's no space in between each file (the next file starts after the previous file's entirety). The last one, though, has extra bits after it for the header, etc., so it's not quite as clean. We can calculate it as "remove the extra gunk - the offset for the last file". The extra gunk in this case is the entire header (however many bytes), header size (an Int64) and 4 bytes for the signature.
Reading each file itself is so straightforward that it doesn't even merit pseudocode! You simply seek to the specified offset for that file then do a BlockRead/whatever for the given file size into your chosen output format. I find that streams are pretty handy for this btw, because most objects have a "LoadFromStream" method. You can read the values into a memory buffer of some sort (either a dynamic array or a TMemoryStream).
The Code
The above pseudocode will be enough to get you started. You can wrap the functions in a class. Alternatively, you can simply download the source code: packing.zip (11 K) for this tutorial instead, saving you the effort
.
You might want to store a CRC32 checksum per-file or to the overall pack, which would allow you to check whether a pack (or a particular file) is corrupt. EFG has CRC32 covered somewhere on his site (I forget exactly where, unfortunately.)
The above method *should* let you create packed files added onto exe files. I tested this theory with notepad; the exe ran properly and the packed data could be viewed in the example program. There may be some issues to do with sharing (i.e., reading from an exe while it's running). Well, finding out for yourself is half the fun, right?
Bookmarks