Interesting questions for some. Non-trivial.
What do you intend to use the data (sums and counts) for? Does it matter how accurate it is?
I quite like the idea of being able to count or "stocktake" things like this.
It seems to be a classic accounting problem, but I don't have any experience or knowledge of how it might be done in this specific case. I suppose estimation could be a pragmatic approach - rather than actual physical counting, I mean.
Off the top of my head (so apologies if this seems a bit rushed):Document files:
Depending on the accuracy required, I think it might be useful - if not necessary - for document files to have some definition.
- to define what is meant by the unit "page" (e.g., A4, Legal, A2, A3, etc.) - so storage unit size would be defined.
- to establish what languages/alphabets you will have in those documents (different alphabet systems may have different packing densities).
- to define what font and point-size you are assuming is used - so density per page could be a concept.
- to define average word-length.
- to define what max, min and average word density would be estimated for the classification of a page-unit. (e.g., do you want to call something with only 5 words on it "A page"?)
- to establish how to cope with pictures (images) in a document, and whether they cover a part of a page (and how much) or a whole page, have captions, headers, etc..
- to establish how to cope with handwriting in a document.
- to establish how to cope with documents (e.g., .PDF or Word files) which have no actual text but only images of pages with words on (this could imply the need for OCRing the documents).
- do you need right-to-left or left-to-right reading/parsing, or both?
- do you have landscape or portrait oriented pages, or both?
- what to do with a frequency estimate for blank pages?
Then you might need to have (say) a function to define the typical density of words, by page.
Physical paper pages could be various sizes, but I suspect you'd have to define a normative/standard size.Audio files:
Not really sure about these.
Should be able to use standard tags of some (e.g., mp3) to get duration (time). I'm not sure, but that might even be a file property for audio files - if so, then Windows Explorer would presumably be able to display it as a column in details view, same as file "Comments".Video files:
Not sure at all about these.
Do they use standard tags for things like duration (time)? (I don't know.)
You might like to ask the question over at Quantified Self
, where they have been looking at similarly knotty problems - e.g., Effect of One-Legged Standing on Sleep
Mind you, I reckon some of their theories haven't got a leg to stand on.