ATTENTION: You are viewing a page formatted for mobile devices; to view the full web page, click HERE.

Main Area and Open Discussion > Living Room

Use video RAM as a swap disk?

<< < (5/7) > >>

Crush:
I know it´s faster to write bigger chunks - that´s the principle behind writecaching, the problem is: What, if the structures / objects consist of very much small datatypes and you have a huge amount to write?

Example:

struct fileobject
{
  char type;
  char attributes;
  UINT modificationdate;
  UINT flags;
  UINT strlen;
  string filename;
} fo;

Normally you define the output in a class and write each member of the strucuture one after each other. That´s also the functionallity archive classes serialize objects simliar as this:
void Fileobject::Write(CFile & out)
{
  out.write(&type, sizeof(type));
  out.write(&attributes, sizeof(attributes));
  out.write(&modificationdate, sizeof(modificationdate));
  out.write(&flags, sizeof(flags));
  out.write(&strlen, sizeof(strlen));
  out.write(&filename, strlen);
};

Often I have filestructures with several 100.000 objects or more (especially on HDs). The only way I could imagine to save this with the normal filesystem is to write static parts of the structure clustered this way:

outputFile.Write((char*)&fo, (char*)&fo->filename - (char*)&fo);
outputFile.Write((char*)&fo->filename, fo->strlen);

It´s possible and much faster than writing something byte by byte (this was only an example to show the system lacks), but not a very fine way to save datas, isn´t it? One problem still exists: The write-commands are depending on the size of the structure. Caching still is much faster than serializing. I first did it this way and wondered why some of my results needed sometimes 10 seconds or more only for writing a few Megabytes of datas. Often the times to build up the structure in memory by reading the directory-structure of partitions took less time (analyzing the dirs the internal XP-Cache was working very fast). This was the reason why I got into benching and thinking of the IO-speeds. As I said before: If you also have to seek() after each object somewhere else to write the complete object size and seek back to the end make out of a slow donky an even slower turtle. This is unfortunately what I am forced to do because of some features that need it!

f0dder:
If you only have simple POD types, you can serialize it all in one go if you stuff it in a struct (which comes naturally if you use the pImpl idiom) - of course there's some potential portability issues by doing this, and it won't work for non-POD types...

But even if you can do this writing and don't have to resort to member-by-member, you definitely should use a write cache to minimize user<>kernel mode transitions.

Having to seek back and forth sounds bad, can't you rearrange your data structures to avoid it? Like, instead of storing the variable-length strings, split those off to a string table and simply use an index or offset integer in the file struct...

Crush:
It´s not possible to reference to strings at another position (this would lead to extensive file-seeking) - I want to be able to redirect directly to the files and the entries by offset to skip unneeded data and only have to access them if I want to analyze/visualize them for different conceptual reasons. My Quickfile class is preventing unnecessary slowdowns in the future - that´s enough for my needs. I only wondered why this performance problem isn´t remarked more often by others handling with big amount datas. The seeking is needed for each data block to write it´s length before (also for skipping sets faster) and to read single blocks directly from the file to memory without useless datas. Calculating the size before writing the block is nearly impossible, because additional informations can be woven into the blocks by plugins. Writing the block size also helps to rearrange/insert/cut/rebuild new files faster at changes. The Strings are the main search & sort criteria and so don´t should be outsourced in additional files or blocks to avoid many open/close/seeks.

f0dder:
My idea was to keep the full string table in memory, this way you wouldn't have to do additional seek/read stuff. This could take a fair amount of memory, but that might be a decent sacrifice for a lot of speed, depending on what the program is for :)

You could also keep the string table on disk and only have {offset, length} pairs in memory, and access the table with a memory-mapped files - that puts you at the mercy of the windows filesystem cache and is only really suitable when you only need read access to the strings, but would use less memory permanently.

Hm, you say plugins can write additional data... unless a plugin can write substantial amounts of data (sevaral megabytes), I would make serialization write to a memory stream before even going to a file class, that way you can easily calculate sizes, and write out bug chunks without seeking back and forth.

How often is plugin data used compared to the "main structure" information? Fixed-size records are nice, you could probably keep the entire "basic" file information set completely in memory, and move plugin data chunks to another file. That would make the basic information very fast & easy to deal with...

Are you doing a file indexer, or something else? What's the typical usage scenario? How often are writes/updates done compared to reads? What kind of file numbers are you dealing with?

I really like brainstorming these kind of optimazion scenarios :)

I only wondered why this performance problem isn´t remarked more often by others handling with big amount datas.
-Crush
--- End quote ---
Some people care, other people don't... I know that some console developers care a lot :), including extensive logging in their bigfile code, so they can trace usage and read patterns, and re-organize data in the bigfiles according to this (makes a lot of difference when you're dealing with the über-slow seeks on optical media). Also means pondering a lot about data structures, finding ways to be able to read them more-or-less directly (with simple post-read fixups) instead of slow & inefficient member-by-member serializing...

There was an article on a gamedev forum by the developers from... I think it was the Commandos game, good read anyway.

Crush:
You´re right! I´m working on a high-speed file-indexer for extremely large amount of datas. Especially for really big and fast medias as the FusionIO! One of it´s feature is the ability to choose the way it handles the information: Seeking the datas in small parts on disc remembering their position for the results, getting the datas cached in memory or insert the real datas from the reference together in memory for further works, so that you can decide the speed/memory usage/working speed on your own. This way it is possible to search/result/browse through millions of results with rather less memory.

According to this I made this 2 threads:
https://www.donationcoder.com/forum/index.php?topic=7764.0
https://www.donationcoder.com/forum/index.php?topic=7183.0 // infos about the cataloger itself. The result list is quite old - I´ve risen the speed up to more than twice!

The plugins should be able to write as much datas as they like, but only informational that can be searched for should go into the main searchbase, others in seperated datasets (files). They are used only as often as the filetype triggers their call. Some plugins should be able to add additional temporary datas to results and database from the internet.

I personally want to Index CD/DVD/HD/Network/FTP and/or HTTP (like a webspider).

The number of files in a single dataset can get up to several 100.000 entries. Big networks could deliver several millions.

I really like brainstorming these kind of optimazion scenarios
--- End quote ---
I also like this and made really a big head about all possible optimizations that could be done. Most of the things I found forced me to compromise something that has advantages & disadvantages in several ways - I had to weight them.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version