I have something like the following in mind - 48 bytes on an x86-32 machine with default padding, 72 bytes for x86-64. Which means you could keep the main structures in-memory for 1 million files and only use 48 megabytes of memory, not bad for speed.
And since it's a fixed-size structure, it would be
very easy
and efficient to do read caching and only keep part of the structure in memory. Having stringtable index for both the "real" path/file string as well as uppercase versions means that if you're searching for a non-wildcard string or checking if a file already exists, you're only doing an integer operation instead of a (relatively
)slow string compare. Here's my idea of a fileobject structure:
struct FileObject
{
uint32 strPath; // stringtable index of pathname (containing folder)
uint32 strPathUp; // -"-, uppercase version
uint32 strName; // stringtable index of filename
uint32 strNameUp; // -"-, uppercase version
uint64 fileSize;
uint32 fileFlags; // flags & attributes
FILETIME timeCreate; // file creation time
FILETIME timeUpdate; // file last update time
uint32 blobIndex; // "blob" table index (plugin data)
};
The string table requires some more explanation, in order to see why it's a good idea. Unique strings are only to be stored once, so identically named files have the same index (fast compares, as mentioned earlier). The string data can be cached as well, so you don't have to keep it all in memory. String indexes are static (ie., adding a new string doesn't cause other string indexes to change) - but at the same time we really want the string to be sorted, to have fast lookups. We want to be able to use binary searches.
So, the string table consists of a "string index table" that maps string indexes as used in the FileObject to the sorted indexes (which
will change upon adding a string). Other than that, the stringtable consists of {u32 size, u32 filepos} pairs (or u64 if you want to support
really huge systems). So, you can keep the entire stringtable index tables in memory, or (again since it's a fixed-size structure) you can do fast & efficient lookups - the index tables as well as the strings themselves can be cached.
Strings and uppercase strings could be in two separate tables, or share one table.
So, if you're doing a non-wildcard search, you start by uppercasing the search string, and then check if it's present in the uppercase string table. If not, you can abort your search right away. If it's present, the lookup gives you an integer for your comparisons, and you can then do ultra-fast binary search for the file. Wildcard searches require somewhat more work, but still not that bad, if you implement caching properly.
The blob table would take some pondering, since it has more variable-length stuff. I'm a big fan of "top-level fixed-size structures" because it makes things easier and more efficient, at the cost of some additional indirection...
struct BlobIndex
{
u32 blobCount; // *number* of blob items for this index
u64 fileOffset; // file position of BlobItem list
};
struct BlobItem
{
GUID blobType; // blob type - I prefer GUIDs to ints
u32 blobSize; // size of the blob item
u64 fileOffset; // file position of the blob data
};
Again, the main structure (array of BlobIndex'es) is small and you can keep a lot of them in memory without problems, and it's easy to cache as well because of the fixed size. It's also fairly fast to get the list of blob items, since it's just seeking to bidx.fileOffset and reading blobCount*sizeof(BlobItem) bytes.
With this approach, and proper serilization, you can eliminate the "write-seekback-updatewrite-seekend" entirely, you don't have to do a "read-seek" loop just to get your lists, etc.
Even for 100.000's of files you could keep the entire main structures in memory without being too much of a memory hog (speed often comes at a size penalty though), but it would still be very easy to do caching if your class layout is decent.
It would be relatively trivial to code as well, I think the "worst" code to write would be updating the string-index table upon sorting, but that's really not too bad. Binary search of strings, as well as stringcompare-by-integercheck means very nice speed properties.
Also, with proper class design, it's also very easy to do performance/stats collecting, which you can use to tune things like cache sizes... either by manual analyzing, or adapting at runtime.
If you want full-text searching in files, that could probably be added as a blob item, although I think I'd prefer an additional file for that.