Problems with a chunk-based file

**ize** · 12-03-2009, 01:37 AM

Just a theory but how about this?

In your Header, put a variable to contain the file offset of the "index". As the header is a fixed size, you can update this for every addition/deletion. For each chunk, also store the the size of that chunk, then at the end of the file after the raw data you store the index which contains the number of chunks, the file offset for each chunk and maybe also a flag to show if it's active or not.
Your file structure would look like:
[Header] [Chunk0] .. [Chunkn] [Index]

For adding:

Go to the index file offset.
Store chunk size and file offset in index
Overwrite index with new data.
Store file offset in Header.
Write index to disk.
Go to the beginning of the file and re-write the Header.

For deleting:

Change the active "flag" in index.
Go to index file offset
Re-write the index

I know this would be a lot of wasted space, but you could write a procedure to delete all "inactive" chunks from the file (kinda like compacting) which only has to be done every so often. It also gives you the option of restoring deleted data just by changing the flag.

I don't know if this is how databases work, but if you're looking for speed, then this is how i'd do it.

edit:
Just re-read post (properly

) The above should still work if the data shrinks, but not if it grows. How often will the data get larger? I thought databases were fast because they used fixed-size records.

**noeska** · 12-03-2009, 05:17 PM

PS my virtual file system already has the file index as 'just another' file inside the archive. Also it is groweable. Shrinking is a bit harder when it involves an chunk in the mids of an archive, i just mark these chunks as deleted and offer them for reuse asap. E.g. for al of this i use a TVirtualFileStream construct that calls the needed TVirtualFileSystem functions. E.g. for the programmer does not need to know any about the TVirtualFileSystem (in a future release not even about opening and closing it.)

http://www.noeska.net/projects/nvfs/

**Brainer** · 12-03-2009, 05:45 PM

I haven't really read it all properly, ize, but thanks for your ideas and I'll analyze them for sure.

I don't think databases use fixed-sized records, because how would you store strings in such records? Of course, if you want to use short strings, it's not a problem at all. But I don't think any user would be satisfied with a database that allows only 255-chars long strings.

Thanks noeska for the link, I'll take a look at the project of yours.

Another thing I was thinking of is to use the following structure:

Any ideas how to implement something like that? I can't really understand how it's possible to maintain such a structure - and I DO know it's possible as it's what SQLite uses and it does its job really fast...

**noeska** · 12-03-2009, 06:48 PM

are you user sqlite uses that structure? An sqlite file seems to grow as more data is put inside. This method can only be used if you fix the total archive file size to a specific size.

On the variable record size.

- one record per chunk
A chunk can contain (for example) 600 bytes. When you write a record in a chunk and that records totalsize is 400 bytes you have 200 bytes left that you could waste. E.g. the next record will be put in a new chunk. If you have a 800 bytes record that would not fit in a 600 byte stucture so that would be split over 2 chunks wasting 400 bytes in de second chunk.

| c1    | c2    | c3
| r1 | ws| r2    | r2 | ws
 
- multiple records per chunk
A chunk can contain (for example) 600 bytes. When you write a 400 byte record in a chunck you need to specify its lengt infront of it. Now the next record can be written in the first chunk after the first 400 bytes + size (e.g. 2 bytes). and continues in subsequent chunks

| c1      | c2    |c3
| r1  | r2| r2    | r2 | fs

Also a list of records e.g. a table can be considere one filestream with
(size)(record)(size)(record)
When a record is added a new chunk is added when needed. When deleting a record an investigation should be made what chunks are involved.

**Brainer** · 12-03-2009, 07:02 PM

Yep, as you can read in the source:

Code:

**   |----------------|
**   | file header  |  100 bytes. Page 1 only.
**   |----------------|
**   | page header  |  8 bytes for leaves. 12 bytes for interior nodes
**   |----------------|
**   | cell pointer  |  | 2 bytes per cell. Sorted order.
**   | array     |  | Grows downward
**   |        |  v
**   |----------------|
**   | unallocated  |
**   | space     |
**   |----------------|  ^ Grows upwards
**   | cell content  |  | Arbitrary order interspersed with freeblocks.
**   | area      |  | and free space fragments.
**   |----------------|

Anyway, why do you think it's only possible when the file is fixed-size? How would that make any difference?

**noeska** · 12-03-2009, 07:30 PM

in my opinion free space in the middle can only exist in a fixed filesize. otherwise it would not be free space?

btw i extended my previous post with info on variable record sizes...

**Brainer** · 12-03-2009, 08:16 PM

Yep, probably the thing about having free space in the middle isn't right, but there's still a problem. How to prevent two different kinds of data from overwritting each other as they grow?

Yep, the idea you proposed is good. AFAIK it's not what they used in hard disks, as one sector = one file.

**noeska** · 12-03-2009, 08:54 PM

Have a read on my second example:

Originally Posted by noeska

- multiple records per chunk
A chunk can contain (for example) 600 bytes. When you write a 400 byte record in a chunck you need to specify its lengt infront of it. Now the next record can be written in the first chunk after the first 400 bytes + size (e.g. 2 bytes). and continues in subsequent chunks

| c1 | c2 |c3
| r1 | r2| r2 | r2 | fs

Also a list of records e.g. a table can be considere one filestream with
(size)(record)(size)(record)
When a record is added a new chunk is added when needed. When deleting a record an investigation should be made what chunks are involved.

This has multiple records per chunk.

**chronozphere** · 12-03-2009, 10:16 PM

Could you not store the record structure inside a stream/file that is also stored in the blocks. E.g. the record structure is just another file inside the archive. No interference that way.

That is an interesting idea, but i would implement it in a slightly other way. If you use the existing file-system to store filesystem-data you will not be able to retrieve this data because it's stored in the system itsself, and you need this meta-data to tell in which file the meta-data is stored.. It's an endless circle isn't it.

I think it's better to think of another approach. Think of it as some kind of encapsulation. The topmost layer of the system describes a file that contains a header and a certain ammount of data-blocks. The header tells us how many blocks are out there, what kind of blocks they are and which block-type has which size.
You could introduce block-types like: "FileData" or "FileHeader".

Each block looks like this:

[BlockIdentifier - 4 bytes]
[BlockType - 4 bytes]
[BlockData - Size depends on BlockType]

Now, the next layer contains the actual filesystem (The file-data and fileheaders). This data is stored inside blocks as described above. So for example you could have these two blocks:

Code:

//A block containing a fileheader

BlockIdentifier = 2356
BlockType   = btFileHeader
BlockData: [
 FileID = 12
 FileName = 'MyFile'
 FileSize = 1254
 .. The rest of your fileheader ... ]

//A block containing file-data

BlockIdentifier = 2357
BlockType   = btFileData
BlockData: [
 //DataBlock Header
 FileBlockID = 43
 BelongsToFile = 12
 UsedSize = 1254
 //Data here
 //.... 
 //.... ]

You are simply encapsulating your filesystem into another blocksystem. Doing it this way allows you to keep adding data to the end of your file without having to move data around to allow parts of the file to grow. This system allows for fragmentation and asks for a defragmentation-method(). You also need to scan all blocks at startup to be able to enumerate all files that exist inside the VFS.
If the added complexity and work is no problem for, you could try this method.

And if you are concerned with speed that much, i would suggest to stick with fixed size records. That comes with a price as your system will consume more HDD space, but that's life. You can't have it all, and you just have to walk the right road instead of trying to walk all of them.

**ize** · 13-03-2009, 01:33 AM

Originally Posted by Brainer

I don't think databases use fixed-sized records, because how would you store strings in such records? Of course, if you want to use short strings, it's not a problem at all. But I don't think any user would be satisfied with a database that allows only 255-chars long strings.

I know Microsoft Access database records are fixed-size. When designing a table, you define the sizes of the fields. For strings you can have a Text field which is up to 255 chars, or a Memo which is up to 65,535 chars. although i'm not sure how it stores the data