Pak Files?

**Alimonster** · 06-08-2004, 02:45 PM

Here's the text of the tutorial from my old site. I'm looking forward to Harry Hunt's one because his packing utility is pretty sweet.

I don't have anywhere to put the source code so see my PM and I'll send it your way by email.

Introduction

Sometimes you don't want to supply 5,000 different files with your project. It's untidy and it uses up quite a lot of space (for example, the images\buttons folder for a default Delphi install has 59K of bitmaps that take up over 5 megs on my FAT32 drive!). The obvious solution here is to crunch all those files together into one file and read each into memory as necessary - much cleaner!

How To Do It

The technique is relatively straightforward, though it presents many annoying traps. We separate the packed file into two sections:

1. The data (each file to be packed)
2. Information about that data (name, size)

The first part is used to store each file's contents (e.g. entire bitmaps). The files will be stored one after another without any space in between. As a result, we need some way to figure out the size of each file and where it is in the overall packed file - this is where the second part comes in!

Let's take a look at the first step, which is to splat each file together into one big gloopy mess. If you're not sure of your file handling, it might be time for you to brush up.

This is the procedure we follow. Note that the following is pseudocode - I'll be adding to it as we go along, so code won't be shown. If you want code, you can download the source code at the end. The pseudocode below will be expanded on, so don't use this version:

Code:

create a new output file
for each file in your list
    if current input file exists then
        get size of current input file
        open current input file
        resize buffer to wanted amount
        read contents of the input file into the buffer
        write the buffer to the output file
        close input file
    end if
end for
close the output file

(Note: do not use the above pseudocode!)

We want to copy each file over into one gigantic packed file, so we create a new file to store everything. For each to-be-packed file, we read the contents and write to the still-opened output file. This has the result of creating a big file with each individual item squashed together. There's a step in between - we have to use a buffer to store the read information first before we can write it. The bigger the buffer here, the better, up to the file's size. If that's too large, though, you can repeatedly read fixed amounts into a buffer until no more is left for the current file.

However, you might have noticed a small problem with the above pseudocode: there's no way to tell where one file starts in the overall pack, making it less than useful!

Storing File Info

We want to store information about each file somewhere, which will let us figure out the starting position and sizes. My chosen technique is to store the absolute position of each file (offset in bytes from the start of the file, position 0). This can be figured out as you write each to the pack by keeping an integer total and adding file_size to it for each input file. Uh, did I make that less-than-clear enough?

I'm going to refer this file-specific info as a header. It can equally be a footer, of course, depending on where you decide to stick it in the packed file

.

The first problem is determining the format of the header. What info needs to be stored for each entry?

* An ID (string)
* The offset in bytes to this file

Anything else is just frills. The ID is necessary so that you can read the file - you may want to say "load somepic.bmp" instead of "load file number 2," because you may not always know the exact index of each file in the pack. The offset is required so that you know where the file starts. Any other details, such as the size, can be calculated later from this info.

The advantage of using a filename is that the type of the file is easily found - simply check the file extension (e.g. ".bmp" for a bitmap). You can add any other data required as additional fields in the record, though, so bear that in mind. The example record above is the minimum you'd get away with for most cases, but the format is entirely up to you.

Next up, there's a minor question. Should each entry be a fixed size? This is simply for the purposes of reading/writing the header at appropriate times. If each entry is a fixed size then the entire header can be read at once, instead of having a for loop reading each individual entry. It's your call again. I'm choosing to be lazy, so I will use a fixed size record. In fact, I'll go one further and use a shortstring instead of a string for the ID

.

[pascal]type
TPackedFileEntry = record
Name: String[32];
Offset: Int64; // offset in *bytes* to this file
end;[/pascal]

The above is info about each file in the pack. You can adjust the size of "name" there - I used 32, but you can increase or shorten it if you want or use normal, resizable strings instead (which can introduce small complications to the unwary). [The sample code uses a slightly smaller string, in case you were wondering, plus a packed record.]

At this point, it's time for a little update of our wanted format. This is the new format:

Original file - whatever was sitting there, optional
Packed data - our extra files
Header - n entries, one for each file we packed and added [it's a footer actually]
Header size - an Int64 giving the header size (n entries * sizeof(each entry))
A signature - the string 'pack'

What's that? Yep, the header has now become a footer, plus we have some extra stuff at the start and end! This change makes the system more adaptable.

The original file refers to your executable, or indeed any other file onto which a pack can be added. This is optional (because you may want your pack files separate for clarity) but the format allows exes to have extra stuff embedded afterwards. The exe will still load and run, but with the added bonus that it's self-contained with its data.

Think about what happens if we want to add our packed file to the end of something else (for example, our main executable). We wouldn't have any idea of how large the original file was because we've padded it with our extra crap! The obvious (or not) solution is to check backwards from the end of the file. We always know where the end of the file is

.

When reading the file, we have to check whether it contains packed stuff first, since that's not a certainty. The simplest method is to write a fixed number of bytes in a signature pattern (in this case, making the word 'pack', but feel free to use whatever you want) to the end. It's very unlikely that an unpacked file would end in those exact bytes, hopefully, so it serves as a good enough check.

The next step is to read the header info - once we know that, it's plain sailing! The first problem is figuring out where that header starts, though, since it can contain different amounts of files! This can be solved by reading in an offset in the file giving the header size, rather than a fixed position. We can seek back however many bytes from the end (skipping the packed-file ID) and then read the header all at once! From that, we know the position of each file and how big the original file was (the offset for the first file entry).

Phew!

Take a breath and re-read the above. The concept is straightforward enough; the only problem is side-stepping possible implementation difficulties (aka, "fiddly file handling").

Implementation

Here are some pseudocode functions (note the absence of type info, for example):

Generating the packed file

Code:

procedure make pack&#40;filename, files to pack&#41;;
begin
  if FileExists&#40;filename&#41; then
  begin
    current pos &#58;= file size&#40;filename&#41;
    append to file&#40;filename&#41;
  end
  else
  begin
    current pos &#58;= 0
    create new file&#40;filename&#41;
  end if

  // assume max entries possible - resize later
  set header entry count&#40;header, count&#40;files to pack&#41;&#41;

  next entry &#58;= 0

  for all files to pack do
  begin
    if file-to-be-packed exists then
    begin
       // step one&#58; dump each file into a big messy gloop
       open current file-to-be-packed
       get file size
       resize buffer
       read data into buffer from file-to-be-packed
       write buffer to output file

       // maintain some info for the header
       header&#91;next entry&#93;'s name &#58;= file-to-be-packed name
       header&#91;next entry&#93;'s offset &#58;= current pos

       close to-be-packed file

       next entry &#58;= next entry + 1
       current pos &#58;= current pos + input file size
    end if
  end for

  // resize to the actual amount used
  set header entry count&#40;header, next entry&#41;

  // step two&#58; write out the header info now
  write file header
  write file header size
  write file signature&#40;'pack'&#41;

  close output file
end

Reading a packed file

function is packed file&#40;somefile&#41;&#58; Boolean
begin
  Result &#58;= &#40;size&#40;somefile&#41; > 4 chars&#41; and
              last four chars&#40;somefile&#41; = 'pack'
end

procedure get header&#40;somefile, header&#41;
var
  header_size&#58; Int64
begin
  if not is packed file&#40;somefile&#41; then
    explode horribly

  move to end of file&#40;somefile&#41;

  // find the header size - ignore the last 4 bytes
  seek backwards&#40;4 bytes + SizeOf&#40;Int64&#41;&#41;

  read header size&#40;somefile, header size&#41;
  seek&#40;end of file - 4 bytes - SizeOf&#40;Int64&#41; - header size&#41; // yeesh!

  set header entry count&#40;header,
                        header size div SizeOf&#40;TPackedFileEntry&#41;&#41;

  read all header&#40;somefile, header, header size&#41;
end

function packed file size&#40;header, which_file&#41;&#58; Int64
begin
  if which_file is_not_last then
    Result &#58;= header&#91;which_file + 1&#93;.offset - header&#91;which_file&#93;.offset
  else
    Result &#58;= &#40;packed_file_size - 4 - sizeof&#40;int64&#41; - header_size&#41; - header&#91;which_file&#93;.offset
end

Getting the size is a little fiddly. If we're not dealing with the last file then we can simply take the difference between two entries, since there's no space in between each file (the next file starts after the previous file's entirety). The last one, though, has extra bits after it for the header, etc., so it's not quite as clean. We can calculate it as "remove the extra gunk - the offset for the last file". The extra gunk in this case is the entire header (however many bytes), header size (an Int64) and 4 bytes for the signature.

Reading each file itself is so straightforward that it doesn't even merit pseudocode! You simply seek to the specified offset for that file then do a BlockRead/whatever for the given file size into your chosen output format. I find that streams are pretty handy for this btw, because most objects have a "LoadFromStream" method. You can read the values into a memory buffer of some sort (either a dynamic array or a TMemoryStream).

The Code

The above pseudocode will be enough to get you started. You can wrap the functions in a class. Alternatively, you can simply download the source code: packing.zip (11 K) for this tutorial instead, saving you the effort

.

You might want to store a CRC32 checksum per-file or to the overall pack, which would allow you to check whether a pack (or a particular file) is corrupt. EFG has CRC32 covered somewhere on his site (I forget exactly where, unfortunately.)

The above method *should* let you create packed files added onto exe files. I tested this theory with notepad; the exe ran properly and the packed data could be viewed in the example program. There may be some issues to do with sharing (i.e., reading from an exe while it's running). Well, finding out for yourself is half the fun, right?

Thread: Pak Files?

Thread Tools

Display

Threaded View

Pak Files?

Bookmarks

Bookmarks

Posting Permissions