topbanner_forum
  *

avatar image

Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
  • Thursday October 31, 2024, 7:15 pm
  • Proudly celebrating 15+ years online.
  • Donate now to become a lifetime supporting member of the site and get a non-expiring license key for all of our programs.
  • donate

Author Topic: File Size vs. Size on Disk: Why such a difference?  (Read 30711 times)

Deozaan

  • Charter Member
  • Joined in 2006
  • ***
  • Points: 1
  • Posts: 9,768
    • View Profile
    • Read more about this member.
    • Donate to Member
File Size vs. Size on Disk: Why such a difference?
« on: April 07, 2010, 01:26 PM »
Last night I learned something new that the geek in me found interesting. I learned the difference between SI prefix names and IEC prefix names. The details are summarized on Ubuntu's Units Policy but really the only thing that has to do with this post is that it made me curious about the discrepancy I see when viewing a file's (or folder's) properties and it shows the size and size on disk to be sometimes quite different.

For example, I have several PortableApps on a 2GiB USB drive and I wanted to see how much space they took up. So viewing the PortableApps' folder properties, it shows:

Size: 602 MB (631,544,356 bytes)
Size on disk: 993 MB (1,041,301,504 bytes)

So my questions is: Can anyone explain to me why the files take up 40% more space on disk than their actual size? Are they retaining water? Wearing a girdle? Did they have plastic surgery?
« Last Edit: May 07, 2013, 04:57 PM by Deozaan »

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,188
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #1 on: April 07, 2010, 01:30 PM »
I think it has to do with block size- at least that's what I always attributed it to.

Update: I thought more about it, and that maybe I needed to expand on my explanation.  The block size is the minimum size of data on a drive.  If there is a file that is smaller than the block size, that's the minimum size that can be taken up even if it's smaller, i.e. if you store a 200 byte file, but the minimum size is 1024 bytes, you lose the other 824 bytes because it has to take a whole 1024 bytes.  Also, since they are allocated in blocks, if something is not exactly a multiple of the block size, there is some waste in space.  That's what I've always attributed the difference to- and looking on wikipedia at least, it seems to be borne up by how they write to NAND drives.
« Last Edit: April 07, 2010, 01:34 PM by wraith808 »

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #2 on: April 07, 2010, 03:35 PM »
wraith808 is right.  The file system has a cluster size that's set when the partition is formatted.  Here's some basic NTFS info on cluster size:
http://www.softwaret.../1/NTFS-Cluster-size

Since the last cluster allocated is rarely perfectly filled, on average you waste 1/2 the cluster size per file on that partition.  It's a trade-off. If you use the minimum cluster size you save space, but you lose performance since you will have to expend more resources tracking a greater number of clusters for the storage space used.

Deozaan

  • Charter Member
  • Joined in 2006
  • ***
  • Points: 1
  • Posts: 9,768
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #3 on: April 07, 2010, 04:12 PM »
Is this something defragging can help with, or is it that multiple files cannot occupy parts of the same block?

Also of note, though I doubt this makes a difference, the USB drive in question is formatted as FAT32.

Carol Haynes

  • Waffles for England (patent pending)
  • Global Moderator
  • Joined in 2005
  • *****
  • Posts: 8,068
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #4 on: April 07, 2010, 04:18 PM »
For example, I have several PortableApps on a 2GiB USB drive and I wanted to see how much space they took up.

What format is the USB drive in? FAT/FAT32 etc. If it isn't in one of the FAT formats try copying the files to a hard disk and reformatting the USB drive in FAT32 format and see if the file size is different when you copy it back.

It does seem like a very large discrepancy even given the way block sizes are used.

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #5 on: April 07, 2010, 05:55 PM »
FAT and NTFS uses the term "cluster size", and multiple files cannot share clusters. NTFS has a size-optimization feature, though, where really-really small files can be stored along with the filesystem metadata about the file.

I guess you have a lot of really small files on that USB stick? Or a ridiculously large cluster size :)
- carpe noctem

4wd

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 5,644
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #6 on: April 07, 2010, 05:56 PM »
More relevant questions are:

How may files are on your USB drive?
What is the cluster size?

For cluster size, the quickest way to find out is to do a Properties on a file whose size is between 1 and 511 bytes and look at the Size on Disk value.

Then multiply the number of files by the cluster size which should give the approximate wasted space due to non-filled clusters.

With flash based devices though, I would have thought you could format with a 512 byte cluster size and experience no noticeable performance hit.  This would minimise space lost.

Edit: Dang that f0dder!  He jumped through a temporal rift and beat me again!
« Last Edit: April 07, 2010, 06:01 PM by 4wd »

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,188
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #7 on: April 07, 2010, 08:32 PM »
FAT and NTFS uses the term "cluster size", and multiple files cannot share clusters. NTFS has a size-optimization feature, though, where really-really small files can be stored along with the filesystem metadata about the file.

I guess you have a lot of really small files on that USB stick? Or a ridiculously large cluster size :)

But does that work on flash drives?  I know that fixed drives use that, but I thought that flash drives didn't allow that for speed considerations?

MilesAhead

  • Supporting Member
  • Joined in 2009
  • **
  • Posts: 7,736
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #8 on: April 07, 2010, 08:55 PM »
Just open a dos prompt, change to the usb drive and run
chkdsk

It will show the bytes per allocation unit.
My usb stick is formatted NTFS and shows 4096

xtabber

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 618
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #9 on: April 07, 2010, 09:11 PM »
Don't know about Linux, but in Windows, you can use Treesize from JAM Software to find out exactly how much space is used/wasted by files on any FAT or NTFS formatted drive and how that would change under different cluster sizes. Treesize Pro has long been one my most often used utilities (it does a lot more than that), but I think the free version will give you the information you want.

Treesize free is at http://www.jam-softw...freeware/index.shtml .

Deozaan

  • Charter Member
  • Joined in 2006
  • ***
  • Points: 1
  • Posts: 9,768
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #10 on: April 07, 2010, 10:59 PM »
Ok. The drive in question is actually a 2 GiB microSD card in a USB adapter.

It's formatted as FAT with a 32 KiB cluster size (32,768 bytes in each allocation unit).

The particular folder I used in my example contains 16,047 files and 2,009 folders.

So 631,544,356 bytes divided by 16,047 files means the average file size is about 39,356 bytes. Which means a full 32 KiB per file plus 6,587 bytes "spill over" into the next 32 KiB cluster.

But actually it's worse than that, since that would still only add up to approximately 105 MB extra (size on disk) but it's actually almost 400MB extra.

So just doing a quick browse of the contents of the drive, I see lots of files that are only a few hundred bytes or just a couple KiB.
« Last Edit: April 08, 2010, 11:31 PM by Deozaan »

mwb1100

  • Supporting Member
  • Joined in 2006
  • **
  • Posts: 1,645
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #11 on: April 08, 2010, 12:39 AM »
Realize that each file will fill all clusters it uses except the last cluster (which might be full, but generally won't be).  If we assume that each file will use on average only half of the last cluster, we come up with the following guesstimate for 'wasted space'

    32768/2 * 16047 = 262,914,048 wasted bytes

Still not accounting for all the waste that you're apparently seeing, but I'd suspect that the true numbers skew toward more files that use far less than a cluster (I think of the files that use only a single cluster, there are far more that use less than half than those that use more than half).

Also, this doesn't account for the waste used by directories (which if I recall, are allocated clusters similarly to files except for the root directory), which would bring us up to close to about 300MB of wasted space.

This analysis hasn't account for all the waste you're seeing, but we're less than a factor of 2 away...

Stoic Joker

  • Honorary Member
  • Joined in 2008
  • **
  • Posts: 6,649
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #12 on: April 08, 2010, 05:54 AM »
If you use the Oformat utility from a Win2k/XP setup disk (should be in tools - iirc), you can format the disk as FAT/FAT32 with (almost) any cluster size you want/need.

f0dder

  • Charter Honorary Member
  • Joined in 2005
  • ***
  • Posts: 9,153
  • [Well, THAT escalated quickly!]
    • View Profile
    • f0dder's place
    • Read more about this member.
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #13 on: April 08, 2010, 10:16 AM »
32kb is a pretty large cluster size - but I guess it might have been set to the flash erase-block size?
- carpe noctem

J-Mac

  • Supporting Member
  • Joined in 2007
  • **
  • Posts: 2,918
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #14 on: April 08, 2010, 01:22 PM »
I always wondered the same thing as the original post and finally learned it here - like most of my computer education. Anyway I looked at very small files on my internal drives and also on a 1 TB external USB drive: all have "Size on Disk" of 4.00 kb, so I guess that means everything on my system is currently formatted with a cluster size of 4096 bytes. (All NTFS).

Thank you.

Jim

Deozaan

  • Charter Member
  • Joined in 2006
  • ***
  • Points: 1
  • Posts: 9,768
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #15 on: April 08, 2010, 11:25 PM »
32kb is a pretty large cluster size - but I guess it might have been set to the flash erase-block size?

Yeah, you're right. Somehow I had been thinking that the 4096 bytes I've seen for NTFS was actually 4 MiB, so I was thinking that 32 KiB was relatively small.

I may have used some Panasonic SD card formatting software on the thing as I'd heard that using Windows formatter could cause problems with microSD cards.

I honestly don't know much about microSD and flash memory. Do you think it would be wise to reformat the card to FAT32 with a 4096 byte cluster size? Or could that cause problems?

Deozaan

  • Charter Member
  • Joined in 2006
  • ***
  • Points: 1
  • Posts: 9,768
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #16 on: April 08, 2010, 11:44 PM »
Realize that each file will fill all clusters it uses except the last cluster (which might be full, but generally won't be).  If we assume that each file will use on average only half of the last cluster, we come up with the following guesstimate for 'wasted space'

    32768/2 * 16047 = 262,914,048 wasted bytes

Still not accounting for all the waste that you're apparently seeing, but I'd suspect that the true numbers skew toward more files that use far less than a cluster (I think of the files that use only a single cluster, there are far more that use less than half than those that use more than half).

Also, this doesn't account for the waste used by directories (which if I recall, are allocated clusters similarly to files except for the root directory), which would bring us up to close to about 300MB of wasted space.

This analysis hasn't account for all the waste you're seeing, but we're less than a factor of 2 away...

I had taken this into consideration with my initial calculations (though my math may have been wrong), but failed to explain it properly. I've reworded it in the post to hopefully make that more clear.

Basically with the average file size being about 39,356 bytes, that means that each file will fill one cluster and the remaining 6,587 bytes "spill over" into the next 32 KiB cluster.

Actually, calculating it again as 16,047 * (1024 * 32) - (16,047 * 6,587) gives me 420,126,507 wasted bytes, which is 400.66 MiB, which is pretty close to the 391 MB wasted.

EDIT: It's only by chance that those calculations came out so close, since assuming every file size is 39,356 bytes would by nature not calculate exactly. The real (estimated) math goes something like this:

Since every file size (assuming every file is of the average size) takes up 2 clusters, the total space on disk should be displaying as 1,051,656,192 bytes. We know that at least half of that is actually being used since each file is more than 32 KiB, so cut it in half to get 525,828,096 bytes. Of that, 105,701,589 is also actually used, which means that the remaining 420,126,507 bytes are "wasted." But even so, if this were an accurate calculation then the size on disk would display as approximately 1 GB rather than "only" approx. 900 MB.
« Last Edit: April 08, 2010, 11:52 PM by Deozaan »

Deozaan

  • Charter Member
  • Joined in 2006
  • ***
  • Points: 1
  • Posts: 9,768
    • View Profile
    • Read more about this member.
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #17 on: April 09, 2010, 01:41 AM »
32kb is a pretty large cluster size - but I guess it might have been set to the flash erase-block size?

I decided to format it and when I went to do that I found out that the default cluster size for FAT is 32 KiB. So I decided to format it to NTFS with 4096 byte cluster size and now there is a significant decrease in wasted bytes.

It went from just under 400 MB of wasted space to just under 40 MB.

Thanks for the info and help, everyone! :Thmbsup:

Stoic Joker

  • Honorary Member
  • Joined in 2008
  • **
  • Posts: 6,649
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #18 on: April 09, 2010, 06:26 AM »
I decided to format it and when I went to do that I found out that the default cluster size for FAT is 32 KiB.
...Which is why I mentioned the MS Oformat utility earlier - That's what it's for - creating low cluster size FAT/FAT32 partitions that can easily/safely be converted to NTFS (with Win2k era convert.exe). It was a required procedure back then if you wanted/needed the NTFS compression to work...and were going to be starting with a FAT drive (for whatever reason).

app103

  • That scary taskbar girl
  • Global Moderator
  • Joined in 2006
  • *****
  • Posts: 5,885
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #19 on: May 04, 2013, 04:20 PM »
I love when I go googling for info and one of the first results is the answer, right here on this forum.  :-*

Screenshot - 5_4_2013 , 5_09_25 PM.png

kyrathaba

  • N.A.N.Y. Organizer
  • Honorary Member
  • Joined in 2006
  • **
  • Posts: 3,200
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #20 on: May 06, 2013, 07:13 AM »
On a related note, when various companies sell you storage media, they use this formula:

1 MB = 1000 KB

I read somewhere that the proper use today, to distinguish this mishmash of units, is to say:

1 MB = 1000 KB

versus

1 MiB = 1024 KiB

I wish everyone would stick to the base-2 way of expressing file sizes.
« Last Edit: May 06, 2013, 11:25 AM by kyrathaba »

wraith808

  • Supporting Member
  • Joined in 2006
  • **
  • default avatar
  • Posts: 11,188
    • View Profile
    • Donate to Member
Re: File Size vs. Size on Disk: Why such a difference?
« Reply #21 on: May 06, 2013, 09:42 AM »
I wish everyone would stick to the base-2 way of expressing file sizes.

Unfortunately, that way is a bit more than the populace is actually comfortable with, especially considering other things in regards to memory usage.