Happy New Year dear fellow geeks!
Many new year celebrations around the world involve at some point listening to the national anthem, or any other background music. During new year’s eve I had an intriguing discussion with friends on music, and data compression. I have decided to take the handle and transfer a part of our discussion into this post.
We are constantly surrounded by sounds and it is our brain’s task to distinguish between various tones, take the important for us and filter out what we (it) decides to be useless. The first part of this post aims to give a brief overview the psychoacoustic phenomenon named “auditory masking”, while the second elaborates on the perceptual coding schemes used for lossy music data compression. In the next few lines you will not see any formulas or complicated modelling, since I want to focus on the principle and draw some conclusions.
We all know that the human ear can normally hear sounds with frequencies from 20 Hz up to around 20 kHz and the latter limit can vary with age. To illustrate a bit, here is how a “bode” plot of the human ear looks like:
An equal loudness human ear sensitivity plot.
The perceived frequency resolution can vary within the hearable range, however its finest region appears to be within 1-4 kHz, which also matches the pitch of most human speech. This property together with the “masking” phenomenon would help us understand what the human ear can not perceive and therefore can be excluded from the music information we store.
So what is auditory masking? Imagine being on a bus station, talking to a friend while suddenly a noisy bus arrives at the bus stop. You are then no longer able to hear your friend, but you can clearly see his lips moving and he is in fact producing audible speech. His speech has been masked by the noise from the bus engine. Here is an example, narrowed down to two / a few tones:
Auditory masking phenomenon example.
Beyond a certain amplitude (marked as mask threshold) we can no longer perceive the weaker tone of interest. This is known as the auditory masking phenomenon:
Auditory masking phenomenon, a more general case.
It is observed that depending on the frequency of the masker tones and tone of interest, the mask threshold varies and the further away the masker tones are from the main tone, the higher the mask threshold is.
Knowing this phenomenon associated with the human ear, back in 1995 Fraunhofer Institute for Integrated Circuits in Erlangen came-up with a clever idea known as perceptual coding. Perceptual coding is a lossy audio encoding scheme reducing quantization steps for sounds which are inaudible and contained in the music/sound data. The nowadays popular mp3 encoding scheme utilizes perceptual coding and the masking phenomenon for lossy audio compression.
Because I like drawing, here is a simple block diagram of the perceptual coding scheme:
A principle diagram of the perceptual coding scheme.
The core of the encoder is the perceptual human ear model, based on the latter, a number of bandpass filter banks are tuned (depending on the ear’s critical bandwidth at various frequencies). We need these to be able to on a later step reduce specific information (bands) detail. The used frequency tone spacing (filters) determines the masking threshold.
The perceptual model steers a variable quantization and sampling rate control block. In practice, the quantization steps of an inaudible tone falling into the corresponding filter bank (band-pass filter) would be greatly reduced. Digital arithmetic number rounding or truncation is normally used for information reduction, however a variable sampling rate control can also be sometimes utilized. After information reduction, some additional noiseless encoding (e.g. look at huffman coding) is performed, before the final bit-stream packing.
Ideally, if the perceptual human ear model has infinite precision, such coding schemes should appear perfect/lossless. Unfortunately this is not the case in reality, but hey, nevertheless, mp3 has changed the world. Isn’t this an incredibly elegant invention of the 20th century? Here is my example for an application of this clever lossy compression scheme 🙂