- LearnVFX Newsletter
- Posts
- Data Compression 101
Data Compression 101
Demystifying compression for VFX and Animation workflows.
What is data compression and how can it help you in VFX and Animation? In this post we’ll look at the basics of data compression both lossless and lossy. What are the options and what are the tradeoffs when using compression vs not using compression?
I recently had a conversation with an experienced colleague that made me realize just how much confusion there can be about data compression even among experienced folks. If my experienced colleague could be confused I assume less experienced folks might be equally confused. Let's demystify compression for good.
The first thing that seems to confuse most people about compression is that there are two types: LOSSLESS and LOSSY. Each is useful for its intended purpose. However, they have slightly different use cases that only intersect in the goal of reducing data footprint. I think that’s what makes the distinction between them confusing for some people. They both compress data but use wildly different strategies to achieve their goal.
Lossy compression, which we will dive into in more detail later, uses techniques that discard some amount of data in order to achieve higher compression ratios. When decoded, lossy compression schemes can only create an approximation of the original data rather than reproducing the original data faithfully. They can only be used for types of data that can tolerate data loss without destroying the intent of the data. Examples of this type of data are images and sound which can both tolerate some loss of precision without a large perceptible loss in quality. (at least not perceptible by humans).
Lossless compression on the other hand is used for reducing the footprint of data when no loss of data is tolerable. Data integrity is just as important as data compaction in this use case so certain approaches to compression like the various schemes of approximation used by lossy compression techniques for sound and images are off the table.
You may also have heard the phrase "Perceptually Lossless". Well, I hate to break it to you but strictly speaking that means lossy. Almost any lossy algorithm can be tuned to create a perceptually lossless result. But "Perceptually Lossless" is not the same as lossless, especially when absolute data fidelity is important.
This is where the confusion seems to creep in. For whatever reason, some people seem to assume that compression always requires some kind of loss. Perhaps this is because it’s a somewhat intuitive conclusion. There must be some kind of equivalent exchange in order to shrink the size of a data set, right? How can a file be made smaller without throwing data away? As we will see, throwing away data to reduce the size of a data set is not a requirement. There is however an equivalent exchange, which we will dive into now.
As its name implies, lossless compression is perfectly mathematically reversible. It is lossless. What you get when you unpack the data from an algorithm is IDENTICAL to the original data, down the bit. If not, the algorithm cannot be called lossless. Lossless compression algorithms exploit the fact that most data has redundancy in it. They identify redundancy in a dataset and create shorthand representations of the redundant data which can be stored in a more compact form. Examples of common lossless compression tools are zip, gzip, 7zip, xz, bzip2 and the newest and coolest kid on the block, zstd. They are based on the plethora of lossless data compression algorithms available in the wild. This type of tool is commonly used to compress data that can’t tolerate loss of any kind. Examples might be text files like logs, scene descriptions like .ma, .mb and .hip files and geometry formats like .bgeo, .obj, .gltf etc. It can also be used to compress image data when any amount of quality loss is unacceptable. How compressible a particular data set is will depend on the content of the data and the algorithm. Certain algorithms perform better on specific types of data. There is no perfect algorithm, though there are several very good general purpose compression algorithms. Highly compressible data (that is data with a lot of redundancy in the set) might allow for compression of 50% or even better, whereas data with practically no redundancy might not compress at all and might in fact become larger if run through a compression algorithm. Truly random data can not be compressed.
Sounds great! So, what is the trade off? With lossless compression, the trade off for smaller file sizes is increased computation time and RAM usage during compression and decompression. The computational cost will depend on the specific algorithm being used. A good rule of thumb is, the more compute and RAM intensive an algorithm, the better the compression it will provide. (Though not always. Some implementations of even the same algorithm can out-perform more poorly implemented versions of the same algorithm and more modern algorithms like zstd which are designed for speed can produce fairly compact files quickly compared to older algorithms. It's tricky to do straight apples to apples comparisons and the state of the art is always improving!)
Due to how fast modern CPUs have become an unexpected benefit of using compression can often times be faster overall file access. This can be true for even fast storage devices like SSDs but is especially true for slower devices like hard drives and network attached storage. Assuming the cost of on-the-fly decompression is less than the speed up of transferring the more compact compressed file from disk or over a network, the end result will be overall faster file access. Given just how much CPU power modern computers have to spare, this is almost always true these days. This benefit is something many people don’t consider. There still seems to be a prevailing belief (which has been outdated for years) that compression is slow. It’s simply not true any more. We often have many CPU cycles to spare so we might as well use them for something useful.
As an added benefit, if the overall network usage can be cut down by some percent by using compression, it also means that percent more files can be copied over a given network in the same amount of time. It might not seem like such a big deal if you are working alone on a fast network but if there are tens or hundreds of other people working on the same network and automated processes like a render farm hitting the same pool of storage, the bandwidth being consumed adds up fast. In that case compression can be a huge win. The benefits are obviously apparent when network bandwidth is highly constrained, for example when sending data over the Internet. Most people seem to understand intuitively why it works on the web but fail to generalize the same concept to local area networks. Network bandwidth is never infinite, no matter how fast a network.
In addition to fitting more info over a network connection, compressing files also allows for more files to fit on disk, so it’s a double win in most cases. Consider the value of space on an SSD, which is still priced at a premium compared to hard drive space. The only trade off is that there is a potential computational cost, but as we will see, that cost can be balanced against the upside of the other factors in play.
Let’s look at a simple example where we losslessly compress some text files with gzip, a very common compression tool available in the base install of pretty much every Linux distro. As a point of reference, gzip uses the same algorithm as the common zip file format on Windows and Mac.
First we will run each file through md5sum to generate a checksum for each. A checksum is like the fingerprint for the data in a file. If the data changes even the tiniest amount, the checksum will change.
aaron@minty:~/project_gutenberg$ md5sum *.txt | tee MD5SUMS
022cb6af4d7c84b4043ece93e071d8ef Frankenstein_by_Mary_Wollstonecraft_Shelley_utf8.txt
424c7f50d156aa6c021ee616cfba6e31 Moby_Dick_by_Herman_Melville_utf8.txt
5f2319239819dfa7ff89ef847b08aff0 Pride_and_Prejudice_by_Jane_Austen_utf8.txt
8676b5095efce2e900de83ab339ac753 The_Art_of_War_by_Sun_Tzu_utf8.txt
2c89aeaa17956a955d789fb393934b9a War_and_Peace_by_Leo_Tolstoy_utf8.txt
We used ‘tee’ to also redirect the output to the file MD5SUMS so we can keep that info around for later.
Now let’s look at the sizes of each file.
aaron@minty:~/project_gutenberg$ ls -lh *.txt
-rw-rw-r-- 1 aaron aaron 439K Apr 15 21:11 Frankenstein_by_Mary_Wollstonecraft_Shelley_utf8.txt
-rw-rw-r-- 1 aaron aaron 1.3M Apr 15 21:13 Moby_Dick_by_Herman_Melville_utf8.txt
-rw-rw-r-- 1 aaron aaron 710K Apr 15 21:12 Pride_and_Prejudice_by_Jane_Austen_utf8.txt
-rw-rw-r-- 1 aaron aaron 336K Apr 15 21:09 The_Art_of_War_by_Sun_Tzu_utf8.txt
-rw-rw-r-- 1 aaron aaron 3.3M Apr 15 21:06 War_and_Peace_by_Leo_Tolstoy_utf8.txt
Even as plain utf-8 War and Peace takes 3.3 megs of disk space. Now we know how big the files started out and and what their md5sums are, let’s compress them with good old gzip. I’ll time each compression so we can get a sense of how much time it’s costing us to compress each file.
aaron@minty:~/project_gutenberg$ for file in $(ls *.txt); do time gzip -v $file; done
Frankenstein_by_Mary_Wollstonecraft_Shelley_utf8.txt: 62.3% -- replaced with Frankenstein_by_Mary_Wollstonecraft_Shelley_utf8.txt.gz
real 0m0.045s
user 0m0.044s
sys 0m0.000s
Moby_Dick_by_Herman_Melville_utf8.txt: 59.9% -- replaced with Moby_Dick_by_Herman_Melville_utf8.txt.gz
real 0m0.125s
user 0m0.116s
sys 0m0.008s
Pride_and_Prejudice_by_Jane_Austen_utf8.txt: 64.1% -- replaced with Pride_and_Prejudice_by_Jane_Austen_utf8.txt.gz
real 0m0.079s
user 0m0.079s
sys 0m0.000s
The_Art_of_War_by_Sun_Tzu_utf8.txt: 62.0% -- replaced with The_Art_of_War_by_Sun_Tzu_utf8.txt.gz
real 0m0.055s
user 0m0.043s
sys 0m0.012s
War_and_Peace_by_Leo_Tolstoy_utf8.txt: 63.5% -- replaced with War_and_Peace_by_Leo_Tolstoy_utf8.txt.gz
real 0m0.354s
user 0m0.321s
sys 0m0.018s
We’ll use the “real” time which is how much actual clock time we had to wait for each one of these files to compress inclusive of all factors. Nice! Every single one of these text files compressed more than 50% and took less than half a second (on one core... gzip is only single threaded!) War and Peace, our largest file, actually got the second highest compression ratio. (Perhaps its sheer size increased the chances of there being redundancies in it that gzip could compress away.) Let’s check out the absolute file sizes of the now compressed files.
One of the things I like about gzip and similar archiving tools on Linux like zstd and xz is that they are typically able to compress files in place. As you can see in my example, all the txt files have been replaced by their .gz compressed counterpart. This is great if you need to free up space but don’t have a lot of disk space to work with since gzip can go through all the files, file by file, and compress them one at a time, cleaning up the old uncompressed files for you as it goes. (Even if the tool itself wasn’t able to do this you can easily script a simple one-liner in bash to do it, which I will demonstrate later.)
Let’s check the sizes of the compressed files.
aaron@minty:~/project_gutenberg$ ls -lh *.txt.gz
-rw-rw-r-- 1 aaron aaron 166K Apr 15 21:11 Frankenstein_by_Mary_Wollstonecraft_Shelley_utf8.txt.gz
-rw-rw-r-- 1 aaron aaron 501K Apr 15 21:13 Moby_Dick_by_Herman_Melville_utf8.txt.gz
-rw-rw-r-- 1 aaron aaron 255K Apr 15 21:12 Pride_and_Prejudice_by_Jane_Austen_utf8.txt.gz
-rw-rw-r-- 1 aaron aaron 128K Apr 15 21:09 The_Art_of_War_by_Sun_Tzu_utf8.txt.gz
-rw-rw-r-- 1 aaron aaron 1.2M Apr 15 21:06 War_and_Peace_by_Leo_Tolstoy_utf8.txt.gz
Sweet! They are all certainly much smaller than they were. But was the data altered in any way? Let’s un-compress the files and verify the md5sums. A difference of even a single bit will cause the md5sum to change so we’ll be able to spot if the output files are identical to the originals.
The following command will use the gunzip to un-compress all the files and if that succeeds, md5sum will run and use the list of the checksums we saved earlier to compare the current checksum of the file with the checksum we saved earlier.
aaron@minty:~/project_gutenberg$ gunzip *.gz && md5sum -c MD5SUMS
Frankenstein_by_Mary_Wollstonecraft_Shelley_utf8.txt: OK
Moby_Dick_by_Herman_Melville_utf8.txt: OK
Pride_and_Prejudice_by_Jane_Austen_utf8.txt: OK
The_Art_of_War_by_Sun_Tzu_utf8.txt: OK
War_and_Peace_by_Leo_Tolstoy_utf8.txt: OK
“OK” means the file matched the md5sum in the file MD5SUMS that we checked it against. The files we round-tripped through gzip are identical to the originals, which is what we expect with a lossless compression tool like this. The output SHOULD be bit perfect compared to the original. If it’s not, something went terribly wrong. If you are an old hand with archive tools like zip and gzip this won't be a surprise to you. If not, you might have just learned something new.
OK, so we proved the output is identical and that it can compress text, but how about something more production data-like? How about some 3D models? We’ll skip checking the md5sums since hopefully I’ve sufficiently demonstrated how lossless compression is in fact lossless.
First let’s check the files sizes.
aaron@minty:~/3d_models$ ls -lh
total 83M
-rw-rw-r-- 1 aaron aaron 11M Jul 13 2010 Advanced_Crew_Escape_Suit.obj
-rw-rw-r-- 1 aaron aaron 43M Jul 13 2010 Extravehicular_Mobility_Unit.obj
-rw------- 1 aaron aaron 468K Oct 29 2008 Shuttle.3ds
-rw------- 1 aaron aaron 677K Sep 5 2008 skylab_carbajal.3ds
-rw-rw-r-- 1 aaron aaron 28M Jun 9 2015 Space_Exploration_Vehicle.obj
These are some non-trivial file sizes here. Plus we have some binary files to work with (the .3ds files). Let’s compress them and see how well gzip does. We’ll time each one again so we know what it’s costing us in cpu time.
aaron@minty:~/3d_models$ for file in $(ls *.?{b,d}?); do time gzip -v $file; done
Advanced_Crew_Escape_Suit.obj: 78.8% -- replaced with Advanced_Crew_Escape_Suit.obj.gz
real 0m0.920s
user 0m0.891s
sys 0m0.027s
Extravehicular_Mobility_Unit.obj: 80.4% -- replaced with Extravehicular_Mobility_Unit.obj.gz
real 0m3.421s
user 0m3.287s
sys 0m0.119s
Shuttle.3ds: 59.0% -- replaced with Shuttle.3ds.gz
real 0m0.045s
user 0m0.036s
sys 0m0.007s
skylab_carbajal.3ds: 62.4% -- replaced with skylab_carbajal.3ds.gz
real 0m0.027s
user 0m0.026s
sys 0m0.005s
Space_Exploration_Vehicle.obj: 75.9% -- replaced with Space_Exploration_Vehicle.obj.gz
real 0m2.939s
user 0m2.813s
sys 0m0.114s
Ok. Now that we are compressing some hefty files the time it’s taking to compress them has gone up a bit. It’s pretty apparent compression isn’t free. How much disk space did we save though? Was the disk space savings worth the computational cost?
aaron@minty:~/3d_models$ ls -lh
total 18M
-rw-rw-r-- 1 aaron aaron 2.3M Jul 13 2010 Advanced_Crew_Escape_Suit.obj.gz
-rw-rw-r-- 1 aaron aaron 8.4M Jul 13 2010 Extravehicular_Mobility_Unit.obj.gz
-rw------- 1 aaron aaron 192K Oct 29 2008 Shuttle.3ds.gz
-rw------- 1 aaron aaron 255K Sep 5 2008 skylab_carbajal.3ds.gz
-rw-rw-r-- 1 aaron aaron 6.7M Jun 9 2015 Space_Exploration_Vehicle.obj.gz
It took 7.352 seconds but we were able to pack 83M of data into 18M. We actually got better compression ratios with the production-like data than we got with English language text! If we were to use a faster setting on gzip or use an alternate algorithm like lz4 perhaps we can balance this compute/size trade off so the cpu cost is nominal yet we still gain the benefit of the smaller file sizes. lz4 is a newer faster algorithm than gzip (zlib). It’s designed for speed rather than maximum compression. The goal the the authors of lz4 had was to reduce the computational cost of compression enough that we gain all the benefits of compression with very little of the computational expense. It’s meant to be very high-throughput. As we will see, they've succeeded. lz4 is available in pretty much every Linux distro nowadays. Let’s give it a try. As with the other examples, I will time each run. Then list the files to look at the compressed file sizes.
aaron@minty:~/3d_models$ for file in $(ls *.?{b,d}?); do time lz4 $file && rm $file ; done
Compressed filename will be : Advanced_Crew_Escape_Suit.obj.lz4
Compressed 11299355 bytes into 4110010 bytes ==> 36.37%
real 0m0.043s
user 0m0.037s
sys 0m0.008s
Compressed filename will be : Extravehicular_Mobility_Unit.obj.lz4
Compressed 44645513 bytes into 15566248 bytes ==> 34.87%
real 0m0.184s
user 0m0.119s
sys 0m0.064s
Compressed filename will be : Shuttle.3ds.lz4
Compressed 478597 bytes into 273672 bytes ==> 57.18%
real 0m0.004s
user 0m0.000s
sys 0m0.003s
Compressed filename will be : skylab_carbajal.3ds.lz4
Compressed 692377 bytes into 360167 bytes ==> 52.02%
real 0m0.004s
user 0m0.003s
sys 0m0.000s
Compressed filename will be : Space_Exploration_Vehicle.obj.lz4
Compressed 29032555 bytes into 12957232 bytes ==> 44.63%
real 0m0.127s
user 0m0.094s
sys 0m0.031s
aaron@minty:~/3d_models$ ls -lh
total 32M
-rw-rw-r-- 1 aaron aaron 4.0M Apr 15 22:36 Advanced_Crew_Escape_Suit.obj.lz4
-rw-rw-r-- 1 aaron aaron 15M Apr 15 22:36 Extravehicular_Mobility_Unit.obj.lz4
-rw-rw-r-- 1 aaron aaron 268K Apr 15 22:36 Shuttle.3ds.lz4
-rw-rw-r-- 1 aaron aaron 352K Apr 15 22:36 skylab_carbajal.3ds.lz4
-rw-rw-r-- 1 aaron aaron 13M Apr 15 22:36 Space_Exploration_Vehicle.obj.lz4
Looking good! I would practically call this “free” compression. It might take us nearly as long to simply copy these files as we are able to compress them. As you can see, it’s possible to balance compression times vs file sizes. lz4 can't typically produce as high a compression ratio as gzip and others but the tradeoff is that it’s substantially faster.
I’d hesitate to consider the time measurements very scientifically valid on this example since they are so short. It only took 0.362 seconds to compress every file and yet we still got better than 50% compression across all the files. By default lz4 is tuned to be as fast as possible. We might be able to afford to give it a bit more time for compression. Let’s give it a try with the -4 flag (higher compression than the default -1)
aaron@minty:~/3d_models$ for file in $(ls *.?{b,d}?); do time lz4 -4 $file && rm $file ; done
Compressed filename will be : Advanced_Crew_Escape_Suit.obj.lz4
Compressed 11299355 bytes into 3211153 bytes ==> 28.42%
real 0m0.235s
user 0m0.198s
sys 0m0.031s
Compressed filename will be : Extravehicular_Mobility_Unit.obj.lz4
Compressed 44645513 bytes into 12460628 bytes ==> 27.91%
real 0m0.945s
user 0m0.877s
sys 0m0.071s
Compressed filename will be : Shuttle.3ds.lz4
Compressed 478597 bytes into 255729 bytes ==> 53.43%
real 0m0.011s
user 0m0.011s
sys 0m0.000s
Compressed filename will be : skylab_carbajal.3ds.lz4
Compressed 692377 bytes into 348951 bytes ==> 50.40%
real 0m0.016s
user 0m0.011s
sys 0m0.004s
Compressed filename will be : Space_Exploration_Vehicle.obj.lz4
Compressed 29032555 bytes into 9917149 bytes ==> 34.16%
real 0m0.697s
user 0m0.644s
sys 0m0.045s
aaron@minty:~/3d_models$ ls -lh
total 25M
-rw-rw-r-- 1 aaron aaron 3.1M Apr 15 22:50 Advanced_Crew_Escape_Suit.obj.lz4
-rw-rw-r-- 1 aaron aaron 12M Apr 15 22:50 Extravehicular_Mobility_Unit.obj.lz4
-rw-rw-r-- 1 aaron aaron 250K Apr 15 22:50 Shuttle.3ds.lz4
-rw-rw-r-- 1 aaron aaron 341K Apr 15 22:50 skylab_carbajal.3ds.lz4
-rw-rw-r-- 1 aaron aaron 9.5M Apr 15 22:50 Space_Exploration_Vehicle.obj.lz4
Not bad! We have improved our overall compression ratio but no file takes more than 1 second to compress. One nice thing about lz4 is that it’s extremely fast to decompress also, regardless of what setting was used for compression. If you feel like throwing more compute at it you can increase the compression ratios provided by lz4 without any real impact on the later decompression of the files. Considering files typically only need to be compressed once but will likely be read and decompressed many times, it's a huge feature of the algorithm. It's very production friendly that way.
So that was a run through of how you might compress files with lossless compression utilities in the OS, but what about applications? How do you make them use compressed files in DCC workflows?
Because so much of the data we use in VFX and animation can benefit from compression, many of the modern file formats we use have some provision for transparent lossless compression built in. OpenEXR has several options for lossless compression (as well as lossy, which we’ll look at later) Alembic has deduplication to help compress data and OpenVDB has built-in compression. Many applications can also transparently deal with files compressed with operating system tools like gzip. For example, Houdini can transparently deal with .gz files (actually any compression type if you have the handler set up for it) Many other programs have such a capability and if they don’t have it built in oftentimes you can extend them by scripting support for it yourself.
When using the native lossless compression in formats like OpenEXR and PNG, you must often make the same trade off of computational cost vs compression. I personally tend to leave some kind of compression on full time since I know that at some point in the files life cycle it will go through some kind of IO bottleneck, either by needing to be copied over a network or to a bandwidth constrained storage device like a USB 3.0 drive. File size is ALWAYS going to become an issue at some point in a project lifecycle due to either storage or bandwidth constraints so it makes sense to protect against a "crunch" by using lossless compression from the beginning. Just do some tests for yourself to make sure it is not too computationally expensive to live with and that the trade off is worth it to you. You’ll most likely find that it is.
One last note on lossless compression. Many filesystems like ZFS and even NTFS offer on-the-fly data compression. ZFS in particular is quite good at on-the-fly data compression and the recommended setting for it is to just leave lz4 compression on full time for all file systems. On-the-fly file compression is great since it does reduce your disk footprint and increase disk IO speed. However, if you are using a NAS it may not improve performance much when moving files over the network since the network will likely be your bottleneck. It is for this reason I still advocate using application level compression on files when possible. Your files will also already be staged for archiving if they are already compressed before needing to be archived.
Now let’s look at lossy compression. If you’ve ever done any photo editing or used a digital camera, you’ve almost certainly encountered the JPEG image format. JPEG uses Discrete Cosine Transform to decompose an image into a frequency domain representation of itself. When you change the quality slider on a JPEG saver what you are doing is telling the codec how much frequency data to throw away. JPEG’s DCT quantization approach is tuned to somewhat match human perception. It selectively removes detail starting from the low frequency detail that a human is not likely to notice missing. The lower you turn the quality slider, the more detail it removes until it begins to remove even the higher frequencies. Eventually the loss in quality will become apparent but it’s possible to get reductions of 6 to even 10x before the quality reduction becomes visible to most humans. With a lossy compression scheme like JPEG, it’s impossible to fully reverse the algorithm to reproduce the original data. It is only possible to create an approximation. There is no going back to the original since the JPEG algorithm literally threw away data to achieve its high compression rate.
Because lossy formats can lose additional data with each generation of compression it’s possible to get compounding loss with each generation or re-compression. This is why it’s important to avoid recompressing most lossy codecs when possible.
In version 2.2 OpenEXR gained a lossy codec contributed by Dreamworks Animation. It’s like JPEG but allows for lossy compression of floating point data. Historically, I personally have been very judicious with my use of lossy codecs for VFX work but there are cases where they can be useful. It boils down to the same balancing of concerns I mentioned earlier of quality, compute, network bandwidth, disk footprint, etc. For example, perhaps you'd prefer to keep a lot of versions of your work on disk rather than have perfect quality in every version. Perhaps you are only making draft versions and perfect quality isn’t even important or you have a real-time playback requirement that only a lossy codec can satisfy. Maybe your network is very slow and the only way you can tame the pain of your slow network is by compressing the heck out of your images. Having to live with a little bit of lossiness in your images might be a better compromise than not delivering a job at all! For years, Dreamworks saved all their rendered images into their own proprietary format which used a 12 bit JPEG-like codec. JPEGing every frame never seemed to be the detail that hurt their box office numbers! The intellectual children of that codec, DWAA and DWAB, are Dreamworks’ contribution to the industry standards in OpenEXR. It now supports floating point image data. At the high quality setting there is no visual loss. At moderate compression rates it only introduces minor visible loss. Many studios use it extensively these days so you should experiment with it yourself and see if it benefits your workflows. It has certainly found a place in my workflows.
When it comes to lossy compression, the trade off is usually about how much visual loss you are willing to accept vs how much disk space you save in exchange. One caveat that must be mentioned regarding lossy compression for production images is that even though the resulting loss might be invisible to the human eye, many image processing algorithms will still be sensitive to the loss and may accentuate artifacts when run on images with the artifacts from lossy compression present in them. A good example would be matte extraction tools. You should evaluate any lossy codec you intend to use at the top of your pipeline (for example for input plates) and determine the best settings so as to not introduce unexpected difficulties or generation loss in subsequent processing steps. DWAA and DWAB codec are very high quality at the lower compression levels but if you find they are not working for you you can fall back to one of the lossless codecs in EXR and still gain the benefit of some compression.
Most lossy codecs are computationally intense as they include in their pipelines the very same algorithms used in lossless compression (like Huffman coding), in addition to others. Let’s look at a very common type of lossy codec, Video codecs. Video codecs are a pretty complicated topic and rather than go into too much detail I will stick to a 10,000ft overview.
When it comes to video codecs, there are two basic types, I-Frame only and Long GOP. I-Frame only codecs compress video on a frame by frame basis, that is to say, there are no dependencies between frames. Examples of this are DNxHD, ProRes, MJPEG and frame based formats like JPEG2000, JPEG, PNG, and OpenEXR. These frame based codecs can be “packed” into a single giant movie file in a wrapper like .mp4, .avi or .mov. It’s not quite as common as video specific codecs being used that way but you do encounter it from time to time in the wild.
Depending on semantics, “Uncompressed Video” could also mean “losslessly compressed” if such an option is available. These would usually be included in the family of I-Frame codecs. There are several lossless codecs available for movie type formats including FFV1, and HUFFYUV. Lossless video codecs are great since they kind of straddle the fence between using no compression at all and lossy options. They don’t provide the same level of compression as lossy codecs but they don’t damage the image at all either.
Long Group of Pictures (GOP) codecs further enhance the levels of compression possible with video by leveraging the fact that there are usually similarities between frames in video. In a Long GOP type codec, each frame may be dependent on frames that come before or after the current frame. As a result, these codecs tend to be quite computationally intensive to both compress and decompress. They are also tricky to decode when shuttling forward or especially backward due to the dependence on the surrounding frames. Examples of some modern Long GOP codecs are h.264, h.265 and AV1. The benefit is that Long GOP codecs are able to produce significantly higher compression ratios than I-Frame only codecs. (or superior quality for the same bandwidth)
For VFX and animation work, both types of formats have their place. For example, lossless i-frame only movie files might be used for internal review where lossy long-GOP or i-frame only movie files might be sent over the Internet to the client as dailies.
The choice of whether to use lossy or lossless compression for output frames and preview movies depends on the resources available and the goals of the studio. If the studio has a lot of resources and a purist approach to the process, they can stick with losslessly compressed formats for everything. If saving disk space and network bandwidth is a higher priority, then it’s possible that lossy formats would take a more prominent role in parts of the pipeline.
I hope this blog post provides you with a good starting place to start thinking about how you deploy compression in your workflows.