Problems with a chunk-based file

**Brainer** · 07-03-2009, 05:58 AM

Hello everyone.

After a long time, I decided to finally start coding again. I'm planning to start working on my VFS again; to all who were waiting - the project is not dead!

Anyway, I wanted to work a little on a side project directly linked with my VFS, and I encountered a problem. Take a look at the following structure:

For offsets, I reserved some space just after the header. It's not really an offsets array, it'll consists of fixed-size records:
[pascal]
type
{ .: TEntryRecord :. }
TEntryRecord = packed record
EntryHash: Cardinal;
OffsetInFile: Int64;
end;
[/pascal]
But as you can see, the reserved space is limited. What if the array grows larger? If I assume correctly, the array will override all data kept after it.

How to avoid that?

**Andreaz** · 07-03-2009, 07:40 AM

Just put the header and offset array in the end of the file instead. Then you just have to rewrite the header when adding data to the file.

This is how the package structure looks like in Phoenix:

Code:

 [File data #1]
	..
 [File data #n]
	..
 [Package header]
	Ident
	Version
	Name
	Description
	File count
 [File header #1]
	Name
	Size
 Resource
	Modified
 Checksum
 Offset
 [File header #n]
 	Name
	Size
 Resource
	Modified
 Checksum
 Offset
 [Package footer]
	Ident
	HeaderPosition

**chronozphere** · 07-03-2009, 07:48 AM

Hey Patrick. Haven't seen you for a while.

When will you be back on MSN?

I had to solve similair problems with my VFS. I use the following structure:

VFS-HeaderData-Blocks + Block-HeadersFile-Headers

When the ammount of data-blocks grows to large, it will override the file-Headers. To prevent this, i've written a routine to Allocate "n" blocks for a file.

> It checks all the blocks to see if any blocks allready belong to the file (this information is stored in the FileID of the blockheader). If more blocks are needed, it'll proceed with the next step.
> It check whether there are any "free" blocks that we could use (Blocks who's blockheaders where FileID = 0). If still more blocks are needed, it'll proceed.
> It reads all fileheader-data to a backup buffer (TMemoryStream). Then it starts adding blocks by overwriting the previous fileheader-data (which is backupped). When enough blocks are added, the fileheader-data will be rewritten after the last block.

It's just a matter of making a backup, overwrite and replacing the backup. It's important that you update your whole system, so it knows about the added blocks and can still find the File-headers allthough they moved.

So in your case... when the offsets array grows too large, you have to backup and replace ALL the block data to be able to allocate some new data. In most cases, i think the RAW-data part of your file will be MUCH bigger than the offsets array. My advice to you is, to swap these two, so the RAW data will go first. This is better (performance-wise), because reading and writing a small peice of block of data (Offsets array) is allways better than having to read/write RAW data which can be just a couple of MB's or even more.

I have one question though. What exactly does "OffsetInFile" contain?

1. The absolute offset from the beginning?
2. Is it relative to the end of the header?
3. Is it relative the end of the offsets array?

This choice is quite important. If the first case is true, you need to update ALL Offsets in each TEntryRecord, because your Offsets array grew bigger. The last choice would be better, because the offsets would still be correct. In that case you have to store the "OffsetsSize" in your Header. To find a piece of data you'd use:

VFSHeader.OffsetSize + EntryRecord.OffsetInFile = location of chunk

Hope this helps. If you would like to discuss any of this. please get on MSN

(in the evening, i'll not be online during the day)

**Brainer** · 07-03-2009, 07:57 AM

Hmm, I don't really get it. See, even if I store raw data before the offsets, there's still a chance of offsets data to be overwritten by raw data. The only way of reproducing it I can think of is to keep it in memory, but I want to avoid that.

Am I wrong?

**chronozphere** · 07-03-2009, 08:26 AM

The actual sollution is to make a backup of the part of the file that is to be overwritten when some other part of the file grows bigger. After that, you just write the backup back to the file but to a new location (just after the newly allocated space). Example:

HH-BBBBBB-GGG

If we want more B, we make a backup of all the G data. Now we overwrite some of the G-data with B-data.

HH-BBBBBBBB-G

Now we write our backupped G data after the new B data.

HH-BBBBBBBB-GGG

My advice is to make sure that the G-part of you file is generally smaller than the B part. Every time you need more B-data you have to make a backup of G and restore it later on. If G is very large, it takes more time to read/write.

Hope this makes sense.

**Brainer** · 07-03-2009, 08:40 AM

Yep, it does now.

But I wonder if it's the same solution used in SQLite. Rewriting the part you want to overwrite is time-consuming. And what if you have 10000 blocks to backup? It'll surely take some time.

I'm thinking of some way to ensure that there's always enough space to add new chunks without worrying that something will be overwritten. Take a look at that:

Maybe something like that could work? But how to prepare such a "divider"? Maybe every time you add new chunks, as raw data will grow "downwards".

**chronozphere** · 07-03-2009, 09:00 AM

You are probably talking about a system with free-space in the middle and two parts of the file growing towards eachother when they get bigger. I don't think it solves the problem. It only makes things more complex. In the end, the offsetarray will clash with the Raw data and you'd have the same problem.

What you could do is:
> Use my trick to make a backup of the part that is to be overwritten and put it back afterwards (see previous post).
> You can make two seperate files, which would be more easy. One storing the VFS-header and the file information, offsets etc... and the other storing raw data. This allows both parts to grow without interfering.

**Brainer** · 07-03-2009, 09:07 AM

Yep, I've been considering using two files instead of one, just as it's done in MySQL (i.e. using MyISAM tables -> http://en.wikipedia.org/wiki/MyISAM). I could use SQLite for a quick entry management, but it simplifies things too much and I want to have some challenge.

The idea of making backups sounds quite good. I could cache entries to memory and rewrite them back. But I made tests and it takes some time. Just like I mentioned in our MSN talk, I made a test that reads/writes a 1 million of entries.

[pascal]
type
{ .: TRec :. }
TRec = packed record
NameHash: Cardinal;
OffsetInFile: Int64; // an absolute offset from the beginning of file
end;
[/pascal]
OperationReadingWritingTime4.02 seconds4.92 seconds

So if you had a lot of entries, you'd have to wait 5 seconds before you can add anything again...

**noeska** · 07-03-2009, 11:03 AM

Could you not store the record structure inside a stream/file that is also stored in the blocks. E.g. the record structure is just another file inside the archive. No interference that way.

**Brainer** · 07-03-2009, 12:34 PM

Hmm, that's a good idea.

I wonder, actually how do databases work? You can freely modify records (i.e. change their sizes), add new and remove existing ones and it's all done very fast... I'd like to know how to make such a system. Are B+trees the way to go? How to use them? It'd be easy if the records were fixed-size. But they just can't - I cannot limit anyone only to 255 characters long strings...

Or maybe I'm complicating things while the answer is really nearby? What do you think?