Lossless Audio Compression
By Amir Majidimehr
This article is about the basics of lossless audio compression. Before we get into this topic, let’s first look at what uncompressed audio looks like. Here, we are talking about digital samples that are N number of bits wide, X number of channels, running at certain sampling frequency. For CD audio, each sample is 16 bits or two bytes, stereo (two channels), and 44,100 samples/sec. Multiply all of this and you get 1.4 Megabits per second. Converted to bytes by dividing by 8 we get 1.7 Kilobytes/sec. So a typical 3-minute song takes up about 32 Megabytes of storage without any form of compression. This may not seem like much but multiply this by three to get six channels and make it a movie soundtrack at 90 minutes and you require an entire DVD just for a single audio track! So some form of compression is handy to have.
Our first tool for dealing with such large amount of data is lossy compression. “Perceptual coding” is the technique used in lossy compression whereby we model the human ear and based on its characteristics are able to sharply reduce the file size with comparatively little loss of fidelity. For example, 128 kbps MP3/AAC/WMA represents an 11:1 compression (1.4/0.128). Inverted, only 9% of the data is kept and the rest is discarded! No matter how much you may be put off by the notion of lossy compression, you have to admire how well it operates given so little data.
I am hoping that you are already familiar with the concept of lossless compression of data. Just about any program you download from the Internet uses lossless compression to reduce its size and hence, speed delivery to your computer. The most common utility that performs this function is “zip.” How does it work? Simply put, instead of having every piece of data represented as fixed number of bits (e.g. 8 bits for each one of the characters in this article), less bits are allocated to most common values (e.g. vowels in the English language) than for less often used values (e.g. letter X in English). Since by definition there are more of the common values, representing them with fewer bits we gain significant level of compression. Actual algorithms are more complex than this but you get the idea.
Alas, zipping techniques do not work with audio. Indeed if you attempt to zip an audio file, its size may even grow rather than shrink! Reason is that audio at first blush is a highly incompressible type of content. The digital samples from a single instrument represent a complex waveform. Combine multiple instruments and vocals and you have something that appears to be a totally random set of numbers with little to no redundancy to remove.
Fortunately by applying some simple mathematics we can extract redundancy that is hidden in the samples. Take the following audio samples for example: 3, 7, 11, 15. If you feed this to the zip program, it sees that the numbers are all different and gives up compressing them. But if we look carefully, we see that each sample is made up of the previous sample plus 4. So instead of storing four numbers, we could simply store the first one and tell the decoder to keep adding 4 to each sample to get the next one in the sequence. In this sense, we would need only two numbers: the initial number “3” and the differential of “4.” The decoder can synthesize the rest of the numbers, giving us 2:1 compression ratio (four numbers becoming two).
The above is a trivial example of what we call “Linear Prediction Coding” or LPC for short. A pretty fancy signal processing term to be sure but fortunately for us, it has its roots in simple algebra. Linear means a set of samples that follow a line. Prediction means that we use the past samples to predict the future ones. You can see how I used both of these aspects in my above example. I assumed the samples were on a line and that the only thing that separated them was an offset. If you remember your college math, this is a form of “curve fitting.” I am trying to find a curve (a line in this scenario) that matches the sequence of numbers.
Of course I cheated in my example by assuming the decoder already knew the shape of the line and that the samples kept going that way forever. Real life is much more complex than that. Numbers may follow a line more or less but not precisely. In the above examples, samples could be 3, 9, 13, and 15. In this case, the lossless compressor still pretends that they line up perfectly. But it also keeps track of the “error” from the perfect line. In would still generate the initial value “3” and increment of “4” but it also has to transmit the error that would be generated by following the straight line. The “residual error” in this example would be “2” for the second sample, “0” for the third, and “-1” for the last.
The residual error values must be transmitted to the receiver efficiently. Fortunately a technique called “Rice coding” (honest, that is what is called !), is used to compress those error values efficiently. Reason is that mathematically we can show that error values have a favorable distribution and hence can be compressed efficiently.
Predicting the shape of the line (i.e. the formula that tells us the approximate value of the next sample) is more complex yet. Different lossless schemes use varying techniques for arriving at what formula best represents future samples. The encoder may try multiple permutations, analyzing each iteration to see if it was more or less efficient. This makes the encoder slower but fortunately, computers have gotten so fast these days that the computational complexity is not a major concern. And outside of live broadcast, we can encode once and be done with it.
But wait, there is more! We have another powerful tool to apply in cases where there is more than one channel of audio. Listen to a typical audio track and you notice the same frequencies often coming out of both speakers (1960s Beatles music excepted). A lossless encoder can divide the spectrum into two or more bands and isolate the mid and low frequency components that tend to be more shared between the two channels. After this division, it can apply different techniques to reduce the data rate such as subtracting the common signal from both channels.
Movie tracks provide even more opportunity for elimination of redundancy as the rear channels are often quiet without much sound in them. For this reason, lossless 5.1 channel codecs can achieve compression efficiencies that are much higher than stereo. For example, Dolby TrueHD used in Blu-ray Disc format when used in 6 channel, 16-bit/48Khz mode can achieve better than 3:1 compression whereas the best 2-channel lossless codec can rarely exceed 2:1 compression.
Of note, while there are a number of different lossless codecs, what separates them is only a single digit difference in compression efficiency. The best may shrink an audio file by 55% and the worst at 60%. Unfortunately that extra 5% may require more work on the part of the encoder making it slower. Again, in a PC this is not material as computers are plenty fast at either task. So your choice is more determined by the hardware or software that supports the specific codec and not which codec to use (unlike lossy codecs which do sound different from each other).
By the way, despite folklore on the Internet, lossless audio codecs do not change the sound as they are proven to mathematically reproduce the original data stream. Nothing is gained or lost. That said, playing a lossless track can sound different on a PC than the original. Why? Well, that is the topic for another article.
Last but not least, note that the data rate for a lossless audio (or video) codec is NEVER fixed. A lossy codec like MP3 or Dolby AC-3 can force the data rate to be fixed by varying quality. A lossless codec by definition cannot vary quality. So as a result, it has no choice but to let the data rate spike as it wants when it sees a complex waveform it cannot shrink. As a rule the spikes are never more than uncompressed stream as the codec can simply choose to pass the original data through, as opposed to trying to “compress” it and make the data set bigger instead (the problem of zip expanding the size is avoided). So if you are streaming audio around your home and are using lossless compression, you need to plan for the full data rate of the source even though you are benefitting from lossless compression in the actual amount of data transferred. For CD audio for example, this will always be 1.4 Mbits/sec.
By Amir Majidimehr
This article is about the basics of lossless audio compression. Before we get into this topic, let’s first look at what uncompressed audio looks like. Here, we are talking about digital samples that are N number of bits wide, X number of channels, running at certain sampling frequency. For CD audio, each sample is 16 bits or two bytes, stereo (two channels), and 44,100 samples/sec. Multiply all of this and you get 1.4 Megabits per second. Converted to bytes by dividing by 8 we get 1.7 Kilobytes/sec. So a typical 3-minute song takes up about 32 Megabytes of storage without any form of compression. This may not seem like much but multiply this by three to get six channels and make it a movie soundtrack at 90 minutes and you require an entire DVD just for a single audio track! So some form of compression is handy to have.
Our first tool for dealing with such large amount of data is lossy compression. “Perceptual coding” is the technique used in lossy compression whereby we model the human ear and based on its characteristics are able to sharply reduce the file size with comparatively little loss of fidelity. For example, 128 kbps MP3/AAC/WMA represents an 11:1 compression (1.4/0.128). Inverted, only 9% of the data is kept and the rest is discarded! No matter how much you may be put off by the notion of lossy compression, you have to admire how well it operates given so little data.
I am hoping that you are already familiar with the concept of lossless compression of data. Just about any program you download from the Internet uses lossless compression to reduce its size and hence, speed delivery to your computer. The most common utility that performs this function is “zip.” How does it work? Simply put, instead of having every piece of data represented as fixed number of bits (e.g. 8 bits for each one of the characters in this article), less bits are allocated to most common values (e.g. vowels in the English language) than for less often used values (e.g. letter X in English). Since by definition there are more of the common values, representing them with fewer bits we gain significant level of compression. Actual algorithms are more complex than this but you get the idea.
Alas, zipping techniques do not work with audio. Indeed if you attempt to zip an audio file, its size may even grow rather than shrink! Reason is that audio at first blush is a highly incompressible type of content. The digital samples from a single instrument represent a complex waveform. Combine multiple instruments and vocals and you have something that appears to be a totally random set of numbers with little to no redundancy to remove.
Fortunately by applying some simple mathematics we can extract redundancy that is hidden in the samples. Take the following audio samples for example: 3, 7, 11, 15. If you feed this to the zip program, it sees that the numbers are all different and gives up compressing them. But if we look carefully, we see that each sample is made up of the previous sample plus 4. So instead of storing four numbers, we could simply store the first one and tell the decoder to keep adding 4 to each sample to get the next one in the sequence. In this sense, we would need only two numbers: the initial number “3” and the differential of “4.” The decoder can synthesize the rest of the numbers, giving us 2:1 compression ratio (four numbers becoming two).
The above is a trivial example of what we call “Linear Prediction Coding” or LPC for short. A pretty fancy signal processing term to be sure but fortunately for us, it has its roots in simple algebra. Linear means a set of samples that follow a line. Prediction means that we use the past samples to predict the future ones. You can see how I used both of these aspects in my above example. I assumed the samples were on a line and that the only thing that separated them was an offset. If you remember your college math, this is a form of “curve fitting.” I am trying to find a curve (a line in this scenario) that matches the sequence of numbers.
Of course I cheated in my example by assuming the decoder already knew the shape of the line and that the samples kept going that way forever. Real life is much more complex than that. Numbers may follow a line more or less but not precisely. In the above examples, samples could be 3, 9, 13, and 15. In this case, the lossless compressor still pretends that they line up perfectly. But it also keeps track of the “error” from the perfect line. In would still generate the initial value “3” and increment of “4” but it also has to transmit the error that would be generated by following the straight line. The “residual error” in this example would be “2” for the second sample, “0” for the third, and “-1” for the last.
The residual error values must be transmitted to the receiver efficiently. Fortunately a technique called “Rice coding” (honest, that is what is called !), is used to compress those error values efficiently. Reason is that mathematically we can show that error values have a favorable distribution and hence can be compressed efficiently.
Predicting the shape of the line (i.e. the formula that tells us the approximate value of the next sample) is more complex yet. Different lossless schemes use varying techniques for arriving at what formula best represents future samples. The encoder may try multiple permutations, analyzing each iteration to see if it was more or less efficient. This makes the encoder slower but fortunately, computers have gotten so fast these days that the computational complexity is not a major concern. And outside of live broadcast, we can encode once and be done with it.
But wait, there is more! We have another powerful tool to apply in cases where there is more than one channel of audio. Listen to a typical audio track and you notice the same frequencies often coming out of both speakers (1960s Beatles music excepted). A lossless encoder can divide the spectrum into two or more bands and isolate the mid and low frequency components that tend to be more shared between the two channels. After this division, it can apply different techniques to reduce the data rate such as subtracting the common signal from both channels.
Movie tracks provide even more opportunity for elimination of redundancy as the rear channels are often quiet without much sound in them. For this reason, lossless 5.1 channel codecs can achieve compression efficiencies that are much higher than stereo. For example, Dolby TrueHD used in Blu-ray Disc format when used in 6 channel, 16-bit/48Khz mode can achieve better than 3:1 compression whereas the best 2-channel lossless codec can rarely exceed 2:1 compression.
Of note, while there are a number of different lossless codecs, what separates them is only a single digit difference in compression efficiency. The best may shrink an audio file by 55% and the worst at 60%. Unfortunately that extra 5% may require more work on the part of the encoder making it slower. Again, in a PC this is not material as computers are plenty fast at either task. So your choice is more determined by the hardware or software that supports the specific codec and not which codec to use (unlike lossy codecs which do sound different from each other).
By the way, despite folklore on the Internet, lossless audio codecs do not change the sound as they are proven to mathematically reproduce the original data stream. Nothing is gained or lost. That said, playing a lossless track can sound different on a PC than the original. Why? Well, that is the topic for another article.
Last but not least, note that the data rate for a lossless audio (or video) codec is NEVER fixed. A lossy codec like MP3 or Dolby AC-3 can force the data rate to be fixed by varying quality. A lossless codec by definition cannot vary quality. So as a result, it has no choice but to let the data rate spike as it wants when it sees a complex waveform it cannot shrink. As a rule the spikes are never more than uncompressed stream as the codec can simply choose to pass the original data through, as opposed to trying to “compress” it and make the data set bigger instead (the problem of zip expanding the size is avoided). So if you are streaming audio around your home and are using lossless compression, you need to plan for the full data rate of the source even though you are benefitting from lossless compression in the actual amount of data transferred. For CD audio for example, this will always be 1.4 Mbits/sec.
Last edited: