ELI5: What are compressed and uncompressed files, how does it all work and why compressed files take less storage?

Read the Story

Show Top Comments

File compression saves hard drive space by removing redundant data. For example take a 500 page book and scan through it to find the 3 most commonly used words. Then replace those words with place holders so ‘the’ becomes $, etc Put an index at the front of the book that translates those symbols to words. Now the book contains exactly the same information as before, but now it’s a couple dozen pages shorter. This is the basics of how file compression works. You find duplicate data in a file and replace it with pointers. The upside is reduced space usage, the downside is your processor has to work harder to *inflate* the file when it’s needed.


Lets say I have a file that contains following: aaaaaaaaaaaaaaaaaaaa I could compress that like this: a20 Obviously it is now smaller. Real compression comes from redundancy and from the fact that most data is wasteful in the first place. A byte is 8 bits and thats basically the smallest amount of data that can be moved or stored. How ever, if you type a message like this, this only contains 26 different letters and some numbers and punctuation. With 5 bits you can encode 31 different characters, so we could already compress the data a lot. Next level is to count the letters and notice that some are way more common than others, so lets give shorter bit lengths per character for those. You can look into Huffman coding for more detailed info. Another form of compression is lossy compression which is used for images, videos and sound. You can easily reduce the amount of colors used in the image and it would still look the same to humans. Also you could merge similar pixels into same color and say that “this 6×6 block is white”.


Suppose you’re writing a grocery list. Your list initially says this: I need to get 6 eggs I need to get 2 liters of soy milk I need to get 2 liters of almond milk I need to get 1 pound of ground beef There’s a lot of repetition in there, right? A smart compression algorithm would recognize that, and might render it like this: I need to get 6 eggs 2 liters of soy milk 2 liters of almond milk 1 pound of ground beef An even better compression algorithm might be able to further improve things: I need to get 6 eggs 2 liters of soy milk almond milk 1 pound of ground beef This is basically what compressing a file does. You take information that’s repeated multiple times and remove that repetition, replacing it with instructions on how to put it back when you need to reconstruct the original content.


I like the text examples One for movies or animations is where they only save what changes between the frames. So if you have 100 frames all black, change them to one black frame and set it so that it takes up the same length of time as the 100 frames did. If you have a shot with blue sky, and it doesn’t change because all the action is going on in the lower half of the frame, save the blue part of the frame and lengthen it/draw it out the same way as was done with the black, once something moves, only then do you have something you need to keep. This can be done for 10000 frames in a row, or it can be done if there are 2 frames with only 10% of the screen the same as the one before it.


Compression works by finding patterns in the data and then storing those patterns instead of the data itself. There are lots of different way to do this and a lot of different theory involved but it is the basic principle. Compression does work better when the compression algorithm is built for the specific file type. So a generic compression algorithm that is made to work on any file does not work as good on say image files as a dedicated image compression algorithm. Some algorithm might even opt to lose some information that is not important and does not fit into an easy pattern. This is most common in image and video where the exact value of each pixel is not that important. Compression algorithms also do not work if there is no patterns to the data. So random data, encrypted data or already compressed data can not be compressed any further.