- Jul 19, 2022
In 1999 the world of music was changed forever with the launch of Napster. Napster was a peer-to-peer file sharing system biased towards music in MP3 format. Considering the internet was still largely on dial-up at the time, network speeds were favourable for music file sharing as a typical four minute MP3 file could be as small as 3Mb at a bitrate of 96kBps. To access the P2P network, one just had to download and install the Napster app and start choosing files. As soon as your app had stored a file or even part of it, the P2P functionality would begin sharing that file with other users. It didn’t take long for the music industry to react however and by 2001 Napster was shut down, buried under an avalanche of lawsuits. Even though it ultimately failed, Napster woke the music business to the realisation that the internet was a threat to its existing business model. Over at Apple however, Steve Jobs saw the coupling of music and the internet as an opportunity. Prior to the demise of Napster, Apple had launched iTunes in January 2001 which was aimed at allowing users to create a personal music library on their PC and followed that up with the launch of the iPod in October 2001. The portable MP3 player was nothing new but Steve Jobs thought that the existing units were either cheap and nasty or too big and clunky. The iPod was an instant success and by the time the product was discontinued in 2022, Apple had sold around 450 Million units worldwide. With the launch of the iTunes Store in 2003, Apple had a tightly integrated software/service/hardware environment that for a while was the dominant player in the digital music market. The next big thing would be launched in 2007 in the form of the first iPhone.
Devices and networks
As iPhones and other smartphones proliferated, so the mobile network became more capable and ubiquitous. It wasn’t until the 2G digital networks started coming on line that there was any useful data bandwidth available to mobile phone users and for practical purposes, the speeds available were nothing special (around 40KBps)until 3G superseded it. 3G offered data speeds of around 144kBps and with ongoing revisions eventually reached over 14MBps. 3G networks started coming online at the end of 2009 which boosted the utility of smartphones enormously.
The birth of streaming
Even with this expanded bandwidth, music streaming was still not a real force majeure until well into the 2000’s. One of the pioneers, Spotify, had started up in 2006 but didn’t make much of a splash until it opened up registrations for UK subscribers in 2010. The demand was so overwhelming that the open registration model was switched to invitation only to cope with the traffic. Spotify launched in the United States in July 2011. Apples answer to Spotify was Apple Music which launched in 2015. Since then there have been numerous streaming services launched like Pandora and Deezer with some like Tidal promising audio quality to “Master” level.
Formats, algorithms and compression
MP3 is the granddaddy of music compression, its full name is MPEG-1 Audio Layer III or MPEG-2 Audio Layer III. MPEG stands for the Motion Picture Expert Group. The MPEG was established in 1988 by the initiative of Dr. Hiroshi Yasuda and Dr. Leonardo Chiariglione. The first MPEG meeting was in May 1988 in Ottawa, Canada. By the late 1990s and continuing to the present, MPEG had grown to include approximately 300–500 members per meeting from various industries, universities, and research institutions combining their expertise for the express purpose of developing methods and standards for digital file compression. Video was the primary goal but audio compression was a natural development in the process. The outcome of the groups efforts compressing audio was impressively good. Compared to the file size of an uncompressed CD sample, the same audio as an MP3 can commonly achieve a 75 to 95% reduction in file size. For example, an MP3 encoded at a constant bitrate of 128 Kbit/s would result in a file approximately 9% of the size of the original CD audio. As a result of this and the popularity of the format, compact disc players increasingly adopted support for playback of MP3 files on data CDs by the early 2000s. Naturally, as bitrates are reduced, compression artefacts become more and more noticeable often manifesting as an audible harshness which can be quite fatiguing to listen to. However, criticising MP3 for lack of audio quality is missing the point, it’s purpose was to enable the transfer of audio files over moderately fast computer networks.
So how is data compression achieved? Compression is achieved by editing out parts of the audio that the encoder determines are inaudible or masked by louder components of the audio. Because of this process of editing, the compression is classified as “lossy” as opposed to “lossless”. This process is the outcome of the study of psychoacoustics which is the branch of psychophysics involving the scientific study of sound perception and audiology—how humans perceive various sounds. More specifically, it is the branch of science studying the psychological responses associated with sound (including noise, speech, and music). Psychoacoustics is an interdisciplinary field of many areas, including psychology, acoustics, electronic engineering, physics, biology, physiology, and computer science. In audio file compression, the primary focus is on masking which is important for all types of lossy encoding. To explain masking I’m going to quote a Wikipedia article:
“Suppose a listener can hear a given acoustical signal under silent conditions. When a signal is playing while another sound is being played (a masker), the signal has to be stronger for the listener to hear it. The masker does not need to have the frequency components of the original signal for masking to happen. A masked signal can be heard even though it is weaker than the masker. Masking happens when a signal and a masker are played together—for instance, when one person whispers while another person shouts—and the listener doesn't hear the weaker signal as it has been masked by the louder masker. Masking can also happen to a signal before a masker starts or after a masker stops. For example, a single sudden loud clap sound can make sounds that immediately precede or follow inaudible. The effects of backward masking is weaker than forward masking. The masking effect has been widely studied in psychoacoustical research. One can change the level of the masker and measure the threshold, then create a diagram of a psychophysical tuning curve that will reveal similar features. Masking effects are also used in lossy audio encoding, such as MP3.”
Subsequently, all types of music compression encoders perform similar functions. Encoding software and the accompanying decoding software are collectively named as a “codec” and anyone who has made use of different file compression would be familiar with the term and maybe also the universe of options available. Among these options are some of the more commonly known and used codecs, AAC (Advanced Audio Codec), Vorbis, FLAC, WAV and a bunch of others.
Sample rate, bit depth and bit rate
- Sample rate is the number of audio samples recorded per unit of time.
- Bit depth measures how precisely the samples were encoded, the resolution.
- Bit rate is the amount of bits that are processed per unit of time and relates to ne needed bandwidth on a network.
The mechanics of streaming
So what is the actual structure of a streaming audio service? Well, it’s pretty simple. The digital files are stored on the host system and exist there in their encoded form, AAC in the case of Apple Music and Vorbis in the case of Spotify. The subscriber accesses the files via an app or player that accesses the chosen file and streams it off the host system. The host system is usually a combination of file servers with intermediate streaming servers. The client is actually connected to the streaming server. The app or website will have various features built into it to create playlists and libraries and other user-centric functions but really it’s no different or more complex than you playing a file on your notebook pc that’s stored on your home server or NAS via your home router/modem. Whether lossy or lossless files are chosen the mechanics are all identical. The encoded audio streams are assembled in a container "bitstream" such as MP4 or FLV. The bitstream is delivered from a streaming server to a subscriber using a transport protocol, such as RTMP (Real-Time Messaging Protocol) or RTP (Real-time Transport Protocol) which are both Audio/Video specific protocols. Music streaming employs the “unicast” method of server to node connection so the server can supply multiple connections to individual destinations meaning that multiple destinations can select and receive the same content. This scenario means that thousands of subscribers can request and receive the same file at the same time and it will play for them individually. This method is demanding on servers and more importantly network infrastructure, more so than “multicasting” where a single source can be accessed by multiple subscribers but if the subscriber joins after the beginning of the stream, they won’t be able to see the start. This is common for live streamed events.
Streaming at a network level
The common thread in all streaming services is the network which forms a backbone upon which the stream can be transmitted. This is the same for your home network or accessing a file from Tidal or Spotify. The streaming data is contained in “packets” of digital data with the most common being the IPv4 form. Internet Protocol version 4 is the fourth version of the Internet Protocol (IP). It is one of the core protocols of standards-based internetworking methods in the Internet and other packet-switched networks. This graphic shows the structure of a single “packet” of IPv4 data.
Now I’m not going to pull this apart in detail because………eyes……. glazing…………over…… However, one thing needs to be illuminated, particularly in the scope of music streaming and that is the error correction built into the protocol.
IPv4 has a header checksum to detect errors in the layer-3 IPv4 packet header, and it discards any packets not matching the header checksum, the payload never reaching the transport layer. Routers only check the IPv4 header checksum. If the header is corrupted the packet is dropped. Payload or higher-layer errors are not detected at the router. The loss of packets is not terminal for the stream, most systems will have buffered a cache of packets that are being decoded by the receiving system. The buffer also permits the re-sending of corrupt or missing packets. Packet losses are only significant when the buffer runs dry and incoming packets are corrupted or missing. In this situation, audible artefacts can be heard on the system decoding the stream. Ethernet networks are largely immune to any form of analogue noise or interference. Ethernet cabling employs the same techniques as balanced audio cabling inasmuch as the cable pairs are twisted to take advantage of “common mode rejection” where hum and noise electromagnetically cancels in the cable and subsequent buffer amplifier circuits. Regardless of this, any kind of “noise” would have to cause packet corruption or loss to impinge on the Ethernet signal at all.
A to D, D to A
If there’s one area where audiophile fantasies can run wild it’s in the analogue-to-digital and the digital-to-analogue part of the process. In truth, there isn’t a lot of variety in the hardware involved in this part of the operation. There might be a slew of manufacturers and models but inside, there are a limited range of operational integrated circuits that perform the lion’s share of the A-D/D-A work. ESS, Analog Devices, Texas Instruments and AKM are a few of the leading DAC manufacturers but when you dip into the specs, you find very similar approaches to the job of decoding a digital audio signal. Indeed there has to be a relatively uniform method as the job of conversion relates to the standard that the audio was encoded to in the first place. Much of the variation at the manufacturing level comes down to the choice of outboard components and critical circuit board design. In a practical sense, the subscriber hanging off a streaming server connection has zero influence over how the stream is encoded, the only choice that can be made is the equipment on the decoding end.
The “High End” DAC
Lets’ pick the Weiss Engineering DAC502. This unit retails for something in the order of $10,000US. It seems to have garnered very positive reviews from such sage institutions as Stereophile. However it’s just a shiny box that contains a circuit board with a bunch of op-amps, resistors and caps surrounding two ESS Sabre D/A chips.
This manufacturer has chosen to run one decoder per channel and parallel the outputs to improve DNR and SNR levels. Nothing special, it’s a configuration suggested on the chip manufacturer’s data sheet. If you read the manufacturers blurb on the unit you’ll find a lot of stuff that seems to suggest that they’re doing something special with the clock and “room EQing” etc which is all built into the chip and nothing to do with Weiss Engineering. In fact, I would suggest that the only “custom” part of this unit is the daughter board which runs the front panel display. So how much is an ESS Sabre chip? Fifty bucks will get you one so this Weiss 502 has $100 worth of DAC chips in it. ESS chips also feature in DAC units from $500 to $1,200US, even portable units for thirty bucks as the VCC can be as low as 3.3V.
Audiophile network equipment
So what do we know about music streaming? We know that we have no control over the encoding and streaming of the music we listen to. We also know that a group of up to 500 experts formulated the compression algorithms that form the basis of all audio and video file compression techniques. We also know that the methods used to transmit these files over networks are done so in ‘packets’ based on protocols that have been designed by network engineers to be robust and reliable, they have error correction built into them and their only enemy is a slow/intermittent connection that can cause lost, late or corrupted packets. Moreover, analogue noise is not an issue on digital networks unless it is causing corrupt/lost packets. Another bogeyman in the audiophile universe is “jitter”. Jitter is a product of signal degradation and can be caused by external interference or simple cable attenuation. Buffering is the primary anti-jitter tactic and any modern digital electronic system will have some capacity for buffering. If this proves insufficient for the prevailing conditions, jitter would present itself as audible artefacts like clicks, pops or dropouts.
Therefore, it is absolute fantasy to represent any type of computer network hardware as being special from the perspective of audio streaming. This is particularly true of modern equipment and networks that now exceed the amount of capacity and speeds that a simple audio stream requires. If you’re a fan of Tidal, their (lossy) MQA format streams at 1.4Mbs. If you’re regularly watching HD video on your TV, that’s streaming at 5Mbs and 4K is running 20Mbs. Not even a thirty buck dumb switch or a poverty spec home router will have a problem with a 1.4mbs stream! But there’s plenty of breathless BS out there supporting manufacturers claims of improved performance/sound with their overpriced, repackaged junk products.
My assertion is that one of this junk is purpose built for audio. Unfortunately, network terminology has opened up another avenue for makers of bogus products to dazzle suggestible punters with. They can now spout their drivel, peppering it with acronyms and technical terminology to sound as plausible as ever to the uneducated. Quality issues with music playback are almost exclusively related to file compression and in this era of terabyte disk drives and gigabyte network hardware, there’s almost no reason to compress audio to the point it begins to sound bad.
I just HAVE to include this excerpt from the comments section on this review:
"All I can say is... wow. This is an IMPRESSIVE amount of nonsense. Absolutely anyone who knows how the Ethernet system actually works will agree. Ethernet data is packetised and error checked at every stage. Each data packet arrives either wholly intact or is discarded and re-sent as many times as needed for an intact delivery. The final data assembled from packets can only be 100% perfect or is rejected entirely. So you'll have complete data dropouts or perfect data. Nothing in between whatsoever." Amen brother...
As always, I don’t represent my posts as the last word. I’m happy to be corrected or educated by other contributors.