• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

The Truth About HiFi Network Devices

Punter

Active Member
Joined
Jul 19, 2022
Messages
184
Likes
1,002
Audio recording has come a long way since the first wax drum was cut in the late 1800’s. The search for ever higher quality and more realistic recording and reproduction has driven numerous technologies forward. From LP records to the Compact Disc, engineers and designers have poured their talents into improving specifications and eliminating distortion in recorded audio. There is no doubt that modern forms of digital recording and reproduction offer a solid improvement over previous technologies despite the moaning of various factions who, consumed with nostalgia, claim that vinyl LP’s are still superior to any other form of recorded music. These individuals are deaf to any technical criticism or glaring examples of limitations in the format. However, as time goes by, these individuals are being coaxed into the present. The superiority of digital over analogue is beyond dispute but just like the CD, digital and more importantly streaming digital music has a convenience aspect unmatched by any other format. So the Luddites are cautiously accepting of this technology and taking it up.

Here on the ASR we are well aware of the “Audiophool”, who is an individual who has been caught up in the “High-End” audio pseudoscience that pervades the HiFi audio space. The vendors and manufacturers who populate the High-End have for decades, relied heavily on the deficiencies of recorded audio to push their products. Without problems, you can’t have solutions. We have all witnessed some of the solutions from crystals to magic wooden blocks, directional wire and on and on it goes. Digitally recorded and reproduced music causes a problem in this ecosystem. Digital music recording was designed to be superior to any technology that preceded it. There were some restrictions built into the early systems by the technology that was available but even so, the two-channel 16-bit PCM encoding at a 44.1 kHz sampling rate per channel in the Red Book standard, allowed for superior audio bandwidth and dynamic range unmatched by any analogue format. Other common distortions were banished as well, hiss, disc surface noise, wow & flutter, all gone.

How can you make money out of tweaks when there’s nothing to tweak? We all know the answer to that question. You simply invent problems and then present the solution to them! An example of this would be the CD “treatments” that began to appear. Magic pens to mark the edge of the disc because the music was being compromised by “internal reflections” which was utter balderdash but many fell for it. Demagnetising CD’s for God’s sake! I had a go at debunking this one for the denizens of the Stereophile forums many years ago. In spite of the fact that I could show two files, one ripped off the disc before demagnetisation and the other afterwards, that were identical, the Audiophools wouldn’t concede in any way. They were told by an equipment manufacturer that demagnetising worked which was backed up by a few reviewer opinions and that was that. What this illustrates is that the “snake oil” vendors and their cheerleaders latched onto this new technology as a “green field” ripe for the plunder. So it’s no surprise that we now find ourselves assailed with new manufactured problems connected with digital music streaming.
Demag_1038_700_0_191ed94b93.jpeg

The streaming and playback of digital music is now well established and even the most ardent CD and turntable fan has been seduced by its convenience and very obvious quality as a source of recorded music. One device that’s been thoroughly whacked with the “High-End” stick is the DAC (Digital to Analog Converter) despite the fact that the DAC chips in the equipment are manufactured by only half a dozen companies, it’s a fertile field for upselling and price inflation using perceived value. There's no question that DAC manufacturers have taken up the challenge of designing equipment that extracts an optimal analogue sound from the available DAC chips and there are accepted quality differences between budget and higher spec units. This is no different to any other component in a sound system. As always, it’s up to the buyer to decide if they’re getting value for money when purchasing their equipment. To a large extent, the DAC market has fully shaken out now. In external DAC’s there are units like the Schitt Modi3 at $100 to a slew of ridiculous over packaged “High-End” units with prices in the stratosphere so the market is fully populated. This being the case, it’s obvious where the next step is for the snake oil end of things. Network equipment.

In a previous post “The Truth About Music Streaming” I outlined the basic flow of music streaming with some details regarding network protocols and codecs. However, there has recently been a couple of the “golden ear” brigade on the interwebs, holding forth on how devices like network switches “sound different”. This is how it starts, the blind leading the blind and a pack of foxes pricking their ears up as a new potential market reveals itself. The next thing we see is the high-end network switch and the HiFi router being championed by the usual promoters. Groan!

The antidote to this should be facts and realities but as we know, the Audiophool is immune to both. Regardless, we here on ASR deserve the truth and knowledge is power so I am going to spell out the digital streaming audio path from the server to the speaker as best I can with enough detail so that we can move into the future with some understanding. Computer networking is not voodoo or black magic, it’s a robust mechanism, developed by brilliant minds and put to the test every day moving trillions of bytes of data globally. It is a system unfazed by transmitting a footling little audio stream from a server in Sweden to your DAC in the USA, Iceland or Tasmania.

Computer Networking for Audio 101

Looked at in its simplest form, a computer network has a relatively simple structure when we’re talking audio.

  • A digital audio file on a storage device
  • A PC or NAS (Network Attached Storage) to act as a server
  • An Ethernet adapter NIC (Network Interface Card)
  • A network hub or switch
  • A network router
  • A suitable LAN (Local Area Network) or WAN (Wide Area Network)
  • A mirror of the serving network
  • A client device requesting the file
  • A DAC (Digital Analogue Converter) to make listenable audio from the digital file
Without getting too deep into digital audio codecs and file types let’s assume that the audio file is an accepted “High Resolution” lossless file, the kind an enthusiast would want to listen to. So let’s make it 192 kHz sample rate with 24bit resolution. A ten minute file of this specification would be 691.2 MB and stream at 9216kBps which is about the same bandwidth needed for an HD quality movie. I’ve chosen ten minutes duration to cover a few different potential music genres. With this information, we can determine the load on the server to supply the file to the client. In the case we’re investigating, the server is a music streaming service. Be aware that this is a “worst case” scenario. Streaming a “normal” audio stream is around 256kBps. Now, before anyone jumps down my throat, I’m going to use Spotify as my example even though they don’t currently have a Hi-Res tier on their service like Tidal or some of the other services. Fact is, there’s lots of info out there regarding the actual architecture of Spotify from a network/equipment perspective. It would appear that many institutions use Spotify as an example for students to investigate and analyse.

The Streaming Service
Streaming-services.jpg

The job of the streaming service is quite simple on the surface. First, store the music files, second, make a searchable catalogue of the files and third, allow the client to access the files. The reality is very different and involves a vast array of equipment and software. So as not to overwhelm the reader, I’ll break the process down with a minimum of jargon and without reference to all of the software that works in concert to produce a song on your device of choice.

As a service, Spotify needs to fulfil the following user requirements:
  • Account creation and Authorization
  • Audio processing
  • Recommendations
  • Fast searching
  • Low latency Streaming
For system requirements, Spotify must expect to handle:
  • Billions of API requests internationally
  • Store several hundred terrabytes of +100 million audio tracks.
  • Store several petabytes of metadata from +500 million users.
For data alone, Spotify needs to store both user data and data related to business, and this can be an infinitely increasing amount, with current data estimates around 5 petabytes.
spotify_system_microservices.jpg

The standard way of building a service like this is usually the “Monolithic” approach meaning that the service has two main components, the “Frontend” and the “Backend”. The Frontend is also referred to as the “Presentation Layer” this layer implements the application UI elements and client-side API requests. It is what the client sees and interacts with. The Backend consists of the Controller Layer (all software integrations through HTTP or other communication methods happen here), the Service Layer (the business logic of the application is present in this layer) and the Database Access Layer: (all the database accesses, including both SQL and NoSQL, of the applications, happens in this layer). This simplifies the site software as communication between two parties namely as a frontend (client) talking to a backend (server).

The only problem with this model is that it doesn’t scale easily. Subsequently, Spotify (and other subscriber services) have moved forward with a “Microservices” model. Microservices builds off monolithic architecture. Instead of defining software as a single executable unit, it divides software into multiple executable units that interoperate with one another. Rather than having one complex client and one complex server communicating with one another, microservices split clients and servers into smaller units, with many simple clients communicating with many simple servers.

One of the prime directives of the service however is low latency. All of the internal workings of the site are geared towards this. Subsequently, a key feature of the client side interaction is the “front loading” of files. As soon as you choose a playlist, album or song, the site starts downloading it to your device. An example of this is if you have a playlist loaded on your device and you lose the network connection, you will probably be able to play five or six songs even when there is no connection. So in truth, services like Spotify are working on caching the files on your device rather than buffering a stream. The upshot of this is that your device is not relying on a buffer and so many of the issues caused by a slow or intermittent connection are nullified.

This fact however is thoroughly inconvenient to the snake oil brigade because it removes a slew of invented problems that can be “caused” by the network equipment between you and the service. So lets’ see if we can explore that so we have a more complete understanding of the mechanics. This understanding should equip us sceptics with the knowledge we need to see through the spurious claims of audiophools and dishonest equipment manufacturers.

Ethernet
Ethernet._AC_.jpg

Without Ethernet, there would be no internet. Ethernet is the backbone of almost all computer networking globally. It is a system developed by brilliant minds to be robust and scalable. To understand Ethernet, one must first delve into the OSI model. OSI stands for the Open Systems Interconnection model and conceptualizes the seven layers that computer systems use to communicate over a network. The model places computing functions into a set of rules to support the inter-operability between products and software. While the modern internet is based on a simpler model called TCP/IP, and not on OSI, the OSI 7-layer model is still widely used today to describe network architecture.
  • 7. Application Layer: Human-computer interaction layer, where applications can access the network services
  • 6. Presentation Layer: Ensures that data is in a usable format and is where data encryption occurs
  • 5. Session Layer: Maintains connections and is responsible for controlling ports and sessions
  • 4. Transport Layer: Transmits data using transmission protocols including TCP and UDP
  • 3. Network Layer: Decides which physical path the data will take
  • 2. Data Link Layer: Defines the format of data on the network
  • 1. Physical Layer: Transmits raw bit stream over the physical medium
The layer model is best understood top down as layer 7 is the point at which you, the human, interacts with elements provided on the network in the form of graphic user interfaces and so forth.

Frames vs Packets
Packets+and+Frames+Relationship+between+packets+and+frames..jpg

One of the important details to understand about Ethernet is the distinction between frames and packets. A frame is a data unit in a local area network (LAN) that includes header and trailer information in addition to the payload or data being transmitted. The header contains information such as the source and destination MAC addresses, while the trailer includes error checking data such as a cyclic redundancy check (CRC). A packet, on the other hand, is a data unit in a wide area network (WAN) or the Internet. It includes header information such as source and destination IP addresses and protocol-specific information, as well as the payload.

A frame in a LAN typically has the following structure:
  • Preamble: A 7-byte sequence used to synchronize the clock speed of the sender and receiver.
  • Start of Frame Delimiter (SFD): A 1-byte sequence that indicates the beginning of the frame.
  • Destination MAC address: A 6-byte field that contains the MAC address of the intended recipient of the frame.
  • Source MAC address: A 6-byte field that contains the MAC address of the sender.
  • Type/Length field: A 2-byte field that indicates either the type of the payload (if it is greater than or equal to 1536 bytes) or the length of the payload (if it is less than 1536 bytes).
  • Payload: The actual data being transmitted, which can range from 46 to 1500 bytes in length.
  • Frame Check Sequence (FCS): A 4-byte field that contains a cyclic redundancy check (CRC) value used to detect errors in the transmission.
A packet in a WAN or the Internet typically has the following structure:
  • Header: Contains various fields such as:
  • Source IP address: A 4-byte field that contains the IP address of the sender.
  • Destination IP address: A 4-byte field that contains the IP address of the intended recipient.
  • Protocol: A 1-byte field that indicates the higher-level protocol being used, such as TCP or UDP.
  • Time to Live (TTL): An 8-bit field that indicates the maximum number of hops a packet can make before it is discarded.
  • Options (if present): Fields that carry additional information about the packet.
  • Payload: The actual data being transmitted, which can range from 0 to 65,535 bytes in length.
  • Trailer: Contains fields such as a cyclic redundancy check (CRC) used to detect errors in the transmission.

In short, a frame is used for communication within a LAN, while a packet is used for communication across WANs or the Internet.

Just as an aside, every Ethernet capable device needs a unique MAC address. These addresses are administered by the IEEE (Institute of Electrical and Electronic Engineers). Unlike IPv4 IP (layer 3) addresses (which are running low) there are a potential 281 trillion possibilities in the 48 character string that makes up a MAC address. That’s roughly 40,000 for every living human. Even if they do start running out, there are already mechanisms in place to expand the system to 64 characters.

Error Correction and Jitter
jw-f2.gif

If there’s one element of Ethernet that the Audiophools like to pounce on it’s this one. As I stated before, Ethernet was designed by brilliant minds to be robust and scalable. It would be almost impossible to calculate the number of frames and packets successfully transmitted over Ethernet networks every day. I emphasise the word “successfully” because the error correction built into Ethernet is very efficient and reliable. Desperate for a source of problems, the Audiophools and Snake Oil salesmen have error correction/jitter near the top of their list of boogeymen to scare the living daylights out of the golden-ear brigade.

So what is “jitter”? Jitter refers to variations in the arrival time of packets of audio data over a network connection. In other words, jitter is the difference between the expected time of arrival of a packet and its actual time of arrival. In a real-time audio streaming scenario, jitter can cause noticeable problems such as audio distortions, skips, or delays. This is because the timing of the audio samples is critical in maintaining the quality of the audio stream. If packets arrive too early or too late, the audio playback may not be in sync, causing audible artefacts. Jitter can be caused by a variety of factors such as network congestion, routing changes, unequal processing times at intermediate routers, or queuing delays. To mitigate the effects of jitter, most audio streaming protocols employ jitter buffering techniques that allow the receiving end to temporarily store incoming packets and play them back at a constant rate. Jitter buffering can help smooth out the variations in packet arrival time, reducing the effects of jitter on the quality of the audio stream.

The fact is that error counts of lost/late/corrupted packets or frames have to get to the point where the jitter buffer that stores the packets, before clocking them out, runs dry. As I pointed out earlier, streaming sites actively “front load” the client with data so that smooth, error free and low latency playback can occur. Indeed some audio applications, like JRiver, don’t start playing until they have downloaded the entire file from the location where it’s stored.

So a range of mitigation is employed to eliminate jitter and provided the buffer has sufficient data stored in it. Network switches do not have a jitter buffer. Jitter buffering is a feature that is implemented in the software or hardware of the receiving end of a network connection, such as a computer or an audio codec. Ultimately, switches are hardware devices that are responsible for forwarding network packets based on their destination MAC address. They do not have the capability to interpret the contents of the packets, buffer them, or manipulate their timing. The primary function of a switch is to forward network packets as quickly as possible and to minimize network congestion by managing the flow of data. So to claim that a network switch can have a perceptible effect on the quality of a stream of frames or packets containing digital music data is preposterous.

Network Timing


Recently, in the “Extreme Snake Oil” thread a member posted up a YouTube video of a gentleman named Hans Beekhuyzen who put forward a set of ideas meant to convince the viewer that internal clocks and network timing could cause some sort of distortion on the waveform decoded by the DAC. His explanation of how this effect occurs merely displays his lack of understanding rather than some sort of amazing revelation. Timing on networks is done by the synchronisation of devices to one of the current timing protocols available to Ethernet devices. The most common is NTP (Network Time Protocol) but there is also Precision Time Protocol (PTP) or Synchronous Ethernet (SyncE) to provide highly accurate timing over the network. For an average home router/modem, the most common time protocol used is Network Time Protocol (NTP). NTP is widely used to synchronize the clocks of network devices and provides an accurate and reliable way to maintain accurate time on the network. Home routers/modems typically use NTP to synchronize with NTP servers on the internet. The router/modem will periodically query one or more NTP servers to determine the correct time and adjust its internal clock accordingly. This helps to ensure that the device's clock is accurate and that it can accurately timestamp network traffic.

Routers
Router0.jpg

I notice nobody has seriously proposed a HiFi router yet but there are plenty of dreamers out there who claim that the router makes a difference to the sound of a music file. This is just as spurious as the claims made for data switches. The purpose of a router in a computer network is to forward network packets from one network to another. A router operates at the network layer (layer 3) of the OSI model and performs routing functions to determine the best path for a network packet to take based on its destination IP address. The router acts as a gateway for communication over a WAN, that we generally call the internet. Once again, routers do not buffer packets, they exist to facilitate communication from Layer 2 to Layer 3.

DAC’s

The DAC represents the best possible target for a whack with the “High End” stick. I went into the subject of DAC’s in my music streaming post earlier and pointed out that there are only a handful of manufacturers making quality DAC chips. Texas Instruments, ESS Technology, Cirrus Logic and AKM Semiconductor are a few of the makers and some would consider these near the top of the quality currently available. Of course, a DAC is a DAC and there are numerous support circuits and devices that are needed to create a functional HiFi component. However, as can be observed in many cases in this era, the “manufacturers don’t necessarily make everything inside the case. In many cases it’s the case that’s had the “High End” treatment and the internals are generic modules. This is true of amplifiers and DAC’s alike. In the case of the “Less Loss” DAC featured in another thread, the makers show some very janky looking internals, which are obviously not designed or manufactured by the brand company. Based on this type of activity I’m fully prepared to believe that there would be differences in the quality of sound from one DAC to another, however subtle.
LessLossDACEchosEndComposant2.jpg

Cables
Nordhost_bullshit.jpg

OHHH PUHLEEEZZE! I know it’s just become an extension of the analogue cable miasma but it’s even more ridiculous (if that’s possible!). Ethernet can work effectively on standard, unshielded CAT5 cable over 100M. Yes that is one hundred metres, with no intermediate equipment, boosters or whatever. The data will be available at the end of that 100M cable, reliably and without serious error. USB is less effective from a distance perspective, being limited to around 5M but until that threshold is reached, the data will be retrievable and largely error free. To claim that any digital network connection can somehow influence the contents of the data frames or packets being transmitted on it is beyond ridiculous. No amount of pseudoscience, phase, RFI, EMI, Gophers or magic spells can make it true.

Conclusion

Ethernet is a demonstrably effective method of networking computers and other devices. It contains within its structure robust methods of error correction that have only become more effective as network speeds have increased. Network timing is achieved by the use of standardised protocols and is largely automatic. Music streaming services make it their business to cache or front load content on the client device to ensure low latency. Data/jitter buffering is automatic and as network speeds have increased, has also become more effective. Lost, late or corrupt packets or frames can be resent before the buffer runs out and processed in the order they were intended. No cable carrying a digital signal can influence the content of the data unless there are gross deficiencies in the cable or its environment. Even under these circumstances, the outcome will be lost, late or corrupt packets that will only be noticeable when the buffer has been depleted. Let’s be clear, the reason that routers, switches and cables do not change the contents of the frames and packets they transmit is because if they did, the whole network system would be utterly useless!!
 

MaxwellsEq

Major Contributor
Joined
Aug 18, 2020
Messages
1,628
Likes
2,426
This is great.

But you don't need any network timing to make large Ethernets work. Your section on NTP/PTP isn't quite right. NTP can be used to make host devices (and the "host" that runs the switch) have almost identical time. But that's not the way standard (asynchronous) Ethernet avoids timing errors. Each network chipset has a clock and (after negotiation regarding duplex and transmission speeds), both ends of the local link run their own clocks which are roughly about right but are not synchronised. The receiving chip is responsible for reading in the frame and then "mapping" the signals to bits. The encoding on the wire from the transmitter effectively "encodes" a clock into the data of the frame. Because the frames are so short (relatively) minor differences between clocks never matter in practice unless a clock is very badly wrong. Once the packet is read, it's finished and a new packet effectively starts the receive clock from zero. If (e.g.) the receive clock is fast and the transmit clock slow, the last few bits may not align with bit boundaries and so the packet will be errored (through the checksum) - but so will all the packets! Either the switch port will be replaced or more likely the receive device will be accused and replaced.

You can prove all this is true by blocking NTP at the edge of your network and disabling NTP on your switch. Even over months the Ethernet packet error rate won't creep up, but many of the hosts will begin to have the wrong time!

I used to design and debug large Ethernet networks. The only reason we ran NTP was to make sure host and switch logs were co-timed to be able to identify cause and effect.

Now, large ST2110 networks are different beasts where I feel the standards bodies sort of went the wrong way and have moved to using PTP to synchronise a network. PTP is genuinely useful if you are using Ethernet networks and PTP profiles to run large pan-continental Power switching systems or do finance trading. The media standards bodies went down the same route in order to control end-to-end time and so be able to recreate a frame/sample broadcast model with minimal latency, which I feel could have been avoided (I've sat on Standard Bodies for Ethernet, IP and Media Standards). https://www.cambridgewireless.co.uk/media/uploads/files/DigitalSIG28.11.18_MarkPatrick.pdf (Slide 19)
 
Last edited:
OP
Punter

Punter

Active Member
Joined
Jul 19, 2022
Messages
184
Likes
1,002
Thanks for your input MaxwellsEq :) . I always hope to get this sort of interaction going with my posts.
 

Hayabusa

Addicted to Fun and Learning
Joined
Oct 12, 2019
Messages
787
Likes
519
Location
Abu Dhabi
Great write up! The only case I would like to add is broadcasting of live streams (internet radio/tv). In this case the rendering device needs to reconstruct the sample clock with some kind of (digital) pll or do sample rate conversion. This process is well understood but could potentially be degraded by packet jitter/delays if not implemented perfectly..
 
Last edited:

MaxwellsEq

Major Contributor
Joined
Aug 18, 2020
Messages
1,628
Likes
2,426
Great write up! The only case I would like to add is broadcasting of live streams (internet radio/tv). In this case the rendering device needs to reconstruct the sample clock with some kind of (digital) pll or do sample rate conversion. This process is well understood but could potentially be degraded by packet jitter/delays if not implemented perfectly..
The mechanism is similar no matter how the "receiver" converts the files back. The way all these devices work is that the user's device pulls blocks of data (usually 64kByte chunks) into a buffer and the audio frames are reconstructed by the software. You can see this with a network analyser.

There's no need for a PLL, because it's all asynchronous. There's no locking onto the source.
 

bothu

Member
Joined
Mar 7, 2021
Messages
88
Likes
335
Location
Linköping, Sweden
Audio recording has come a long way since the first wax drum was cut in the late 1800’s. The search for ever higher quality and more realistic recording and reproduction has driven numerous technologies forward. From LP records to the Compact Disc, engineers and designers have poured their talents into improving specifications and eliminating distortion in recorded audio. There is no doubt that modern forms of digital recording and reproduction offer a solid improvement over previous technologies despite the moaning of various factions who, consumed with nostalgia, claim that vinyl LP’s are still superior to any other form of recorded music. These individuals are deaf to any technical criticism or glaring examples of limitations in the format. However, as time goes by, these individuals are being coaxed into the present. The superiority of digital over analogue is beyond dispute but just like the CD, digital and more importantly streaming digital music has a convenience aspect unmatched by any other format. So the Luddites are cautiously accepting of this technology and taking it up.

Here on the ASR we are well aware of the “Audiophool”, who is an individual who has been caught up in the “High-End” audio pseudoscience that pervades the HiFi audio space. The vendors and manufacturers who populate the High-End have for decades, relied heavily on the deficiencies of recorded audio to push their products. Without problems, you can’t have solutions. We have all witnessed some of the solutions from crystals to magic wooden blocks, directional wire and on and on it goes. Digitally recorded and reproduced music causes a problem in this ecosystem. Digital music recording was designed to be superior to any technology that preceded it. There were some restrictions built into the early systems by the technology that was available but even so, the two-channel 16-bit PCM encoding at a 44.1 kHz sampling rate per channel in the Red Book standard, allowed for superior audio bandwidth and dynamic range unmatched by any analogue format. Other common distortions were banished as well, hiss, disc surface noise, wow & flutter, all gone.

How can you make money out of tweaks when there’s nothing to tweak? We all know the answer to that question. You simply invent problems and then present the solution to them! An example of this would be the CD “treatments” that began to appear. Magic pens to mark the edge of the disc because the music was being compromised by “internal reflections” which was utter balderdash but many fell for it. Demagnetising CD’s for God’s sake! I had a go at debunking this one for the denizens of the Stereophile forums many years ago. In spite of the fact that I could show two files, one ripped off the disc before demagnetisation and the other afterwards, that were identical, the Audiophools wouldn’t concede in any way. They were told by an equipment manufacturer that demagnetising worked which was backed up by a few reviewer opinions and that was that. What this illustrates is that the “snake oil” vendors and their cheerleaders latched onto this new technology as a “green field” ripe for the plunder. So it’s no surprise that we now find ourselves assailed with new manufactured problems connected with digital music streaming.
View attachment 263144
The streaming and playback of digital music is now well established and even the most ardent CD and turntable fan has been seduced by its convenience and very obvious quality as a source of recorded music. One device that’s been thoroughly whacked with the “High-End” stick is the DAC (Digital to Analog Converter) despite the fact that the DAC chips in the equipment are manufactured by only half a dozen companies, it’s a fertile field for upselling and price inflation using perceived value. There's no question that DAC manufacturers have taken up the challenge of designing equipment that extracts an optimal analogue sound from the available DAC chips and there are accepted quality differences between budget and higher spec units. This is no different to any other component in a sound system. As always, it’s up to the buyer to decide if they’re getting value for money when purchasing their equipment. To a large extent, the DAC market has fully shaken out now. In external DAC’s there are units like the Schitt Modi3 at $100 to a slew of ridiculous over packaged “High-End” units with prices in the stratosphere so the market is fully populated. This being the case, it’s obvious where the next step is for the snake oil end of things. Network equipment.

In a previous post “The Truth About Music Streaming” I outlined the basic flow of music streaming with some details regarding network protocols and codecs. However, there has recently been a couple of the “golden ear” brigade on the interwebs, holding forth on how devices like network switches “sound different”. This is how it starts, the blind leading the blind and a pack of foxes pricking their ears up as a new potential market reveals itself. The next thing we see is the high-end network switch and the HiFi router being championed by the usual promoters. Groan!

The antidote to this should be facts and realities but as we know, the Audiophool is immune to both. Regardless, we here on ASR deserve the truth and knowledge is power so I am going to spell out the digital streaming audio path from the server to the speaker as best I can with enough detail so that we can move into the future with some understanding. Computer networking is not voodoo or black magic, it’s a robust mechanism, developed by brilliant minds and put to the test every day moving trillions of bytes of data globally. It is a system unfazed by transmitting a footling little audio stream from a server in Sweden to your DAC in the USA, Iceland or Tasmania.

Computer Networking for Audio 101

Looked at in its simplest form, a computer network has a relatively simple structure when we’re talking audio.

  • A digital audio file on a storage device
  • A PC or NAS (Network Attached Storage) to act as a server
  • An Ethernet adapter NIC (Network Interface Card)
  • A network hub or switch
  • A network router
  • A suitable LAN (Local Area Network) or WAN (Wide Area Network)
  • A mirror of the serving network
  • A client device requesting the file
  • A DAC (Digital Analogue Converter) to make listenable audio from the digital file
Without getting too deep into digital audio codecs and file types let’s assume that the audio file is an accepted “High Resolution” lossless file, the kind an enthusiast would want to listen to. So let’s make it 192 kHz sample rate with 24bit resolution. A ten minute file of this specification would be 691.2 MB and stream at 9216kBps which is about the same bandwidth needed for an HD quality movie. I’ve chosen ten minutes duration to cover a few different potential music genres. With this information, we can determine the load on the server to supply the file to the client. In the case we’re investigating, the server is a music streaming service. Be aware that this is a “worst case” scenario. Streaming a “normal” audio stream is around 256kBps. Now, before anyone jumps down my throat, I’m going to use Spotify as my example even though they don’t currently have a Hi-Res tier on their service like Tidal or some of the other services. Fact is, there’s lots of info out there regarding the actual architecture of Spotify from a network/equipment perspective. It would appear that many institutions use Spotify as an example for students to investigate and analyse.

The Streaming Service
View attachment 263145
The job of the streaming service is quite simple on the surface. First, store the music files, second, make a searchable catalogue of the files and third, allow the client to access the files. The reality is very different and involves a vast array of equipment and software. So as not to overwhelm the reader, I’ll break the process down with a minimum of jargon and without reference to all of the software that works in concert to produce a song on your device of choice.

As a service, Spotify needs to fulfil the following user requirements:
  • Account creation and Authorization
  • Audio processing
  • Recommendations
  • Fast searching
  • Low latency Streaming
For system requirements, Spotify must expect to handle:
  • Billions of API requests internationally
  • Store several hundred terrabytes of +100 million audio tracks.
  • Store several petabytes of metadata from +500 million users.
For data alone, Spotify needs to store both user data and data related to business, and this can be an infinitely increasing amount, with current data estimates around 5 petabytes.
View attachment 263146
The standard way of building a service like this is usually the “Monolithic” approach meaning that the service has two main components, the “Frontend” and the “Backend”. The Frontend is also referred to as the “Presentation Layer” this layer implements the application UI elements and client-side API requests. It is what the client sees and interacts with. The Backend consists of the Controller Layer (all software integrations through HTTP or other communication methods happen here), the Service Layer (the business logic of the application is present in this layer) and the Database Access Layer: (all the database accesses, including both SQL and NoSQL, of the applications, happens in this layer). This simplifies the site software as communication between two parties namely as a frontend (client) talking to a backend (server).

The only problem with this model is that it doesn’t scale easily. Subsequently, Spotify (and other subscriber services) have moved forward with a “Microservices” model. Microservices builds off monolithic architecture. Instead of defining software as a single executable unit, it divides software into multiple executable units that interoperate with one another. Rather than having one complex client and one complex server communicating with one another, microservices split clients and servers into smaller units, with many simple clients communicating with many simple servers.

One of the prime directives of the service however is low latency. All of the internal workings of the site are geared towards this. Subsequently, a key feature of the client side interaction is the “front loading” of files. As soon as you choose a playlist, album or song, the site starts downloading it to your device. An example of this is if you have a playlist loaded on your device and you lose the network connection, you will probably be able to play five or six songs even when there is no connection. So in truth, services like Spotify are working on caching the files on your device rather than buffering a stream. The upshot of this is that your device is not relying on a buffer and so many of the issues caused by a slow or intermittent connection are nullified.

This fact however is thoroughly inconvenient to the snake oil brigade because it removes a slew of invented problems that can be “caused” by the network equipment between you and the service. So lets’ see if we can explore that so we have a more complete understanding of the mechanics. This understanding should equip us sceptics with the knowledge we need to see through the spurious claims of audiophools and dishonest equipment manufacturers.

Ethernet
View attachment 263147
Without Ethernet, there would be no internet. Ethernet is the backbone of almost all computer networking globally. It is a system developed by brilliant minds to be robust and scalable. To understand Ethernet, one must first delve into the OSI model. OSI stands for the Open Systems Interconnection model and conceptualizes the seven layers that computer systems use to communicate over a network. The model places computing functions into a set of rules to support the inter-operability between products and software. While the modern internet is based on a simpler model called TCP/IP, and not on OSI, the OSI 7-layer model is still widely used today to describe network architecture.
  • 7. Application Layer: Human-computer interaction layer, where applications can access the network services
  • 6. Presentation Layer: Ensures that data is in a usable format and is where data encryption occurs
  • 5. Session Layer: Maintains connections and is responsible for controlling ports and sessions
  • 4. Transport Layer: Transmits data using transmission protocols including TCP and UDP
  • 3. Network Layer: Decides which physical path the data will take
  • 2. Data Link Layer: Defines the format of data on the network
  • 1. Physical Layer: Transmits raw bit stream over the physical medium
The layer model is best understood top down as layer 7 is the point at which you, the human, interacts with elements provided on the network in the form of graphic user interfaces and so forth.

Frames vs Packets
View attachment 263148
One of the important details to understand about Ethernet is the distinction between frames and packets. A frame is a data unit in a local area network (LAN) that includes header and trailer information in addition to the payload or data being transmitted. The header contains information such as the source and destination MAC addresses, while the trailer includes error checking data such as a cyclic redundancy check (CRC). A packet, on the other hand, is a data unit in a wide area network (WAN) or the Internet. It includes header information such as source and destination IP addresses and protocol-specific information, as well as the payload.

A frame in a LAN typically has the following structure:
  • Preamble: A 7-byte sequence used to synchronize the clock speed of the sender and receiver.
  • Start of Frame Delimiter (SFD): A 1-byte sequence that indicates the beginning of the frame.
  • Destination MAC address: A 6-byte field that contains the MAC address of the intended recipient of the frame.
  • Source MAC address: A 6-byte field that contains the MAC address of the sender.
  • Type/Length field: A 2-byte field that indicates either the type of the payload (if it is greater than or equal to 1536 bytes) or the length of the payload (if it is less than 1536 bytes).
  • Payload: The actual data being transmitted, which can range from 46 to 1500 bytes in length.
  • Frame Check Sequence (FCS): A 4-byte field that contains a cyclic redundancy check (CRC) value used to detect errors in the transmission.
A packet in a WAN or the Internet typically has the following structure:
  • Header: Contains various fields such as:
  • Source IP address: A 4-byte field that contains the IP address of the sender.
  • Destination IP address: A 4-byte field that contains the IP address of the intended recipient.
  • Protocol: A 1-byte field that indicates the higher-level protocol being used, such as TCP or UDP.
  • Time to Live (TTL): An 8-bit field that indicates the maximum number of hops a packet can make before it is discarded.
  • Options (if present): Fields that carry additional information about the packet.
  • Payload: The actual data being transmitted, which can range from 0 to 65,535 bytes in length.
  • Trailer: Contains fields such as a cyclic redundancy check (CRC) used to detect errors in the transmission.

In short, a frame is used for communication within a LAN, while a packet is used for communication across WANs or the Internet.

Just as an aside, every Ethernet capable device needs a unique MAC address. These addresses are administered by the IEEE (Institute of Electrical and Electronic Engineers). Unlike IPv4 IP (layer 3) addresses (which are running low) there are a potential 281 trillion possibilities in the 48 character string that makes up a MAC address. That’s roughly 40,000 for every living human. Even if they do start running out, there are already mechanisms in place to expand the system to 64 characters.

Error Correction and Jitter
View attachment 263149
If there’s one element of Ethernet that the Audiophools like to pounce on it’s this one. As I stated before, Ethernet was designed by brilliant minds to be robust and scalable. It would be almost impossible to calculate the number of frames and packets successfully transmitted over Ethernet networks every day. I emphasise the word “successfully” because the error correction built into Ethernet is very efficient and reliable. Desperate for a source of problems, the Audiophools and Snake Oil salesmen have error correction/jitter near the top of their list of boogeymen to scare the living daylights out of the golden-ear brigade.

So what is “jitter”? Jitter refers to variations in the arrival time of packets of audio data over a network connection. In other words, jitter is the difference between the expected time of arrival of a packet and its actual time of arrival. In a real-time audio streaming scenario, jitter can cause noticeable problems such as audio distortions, skips, or delays. This is because the timing of the audio samples is critical in maintaining the quality of the audio stream. If packets arrive too early or too late, the audio playback may not be in sync, causing audible artefacts. Jitter can be caused by a variety of factors such as network congestion, routing changes, unequal processing times at intermediate routers, or queuing delays. To mitigate the effects of jitter, most audio streaming protocols employ jitter buffering techniques that allow the receiving end to temporarily store incoming packets and play them back at a constant rate. Jitter buffering can help smooth out the variations in packet arrival time, reducing the effects of jitter on the quality of the audio stream.

The fact is that error counts of lost/late/corrupted packets or frames have to get to the point where the jitter buffer that stores the packets, before clocking them out, runs dry. As I pointed out earlier, streaming sites actively “front load” the client with data so that smooth, error free and low latency playback can occur. Indeed some audio applications, like JRiver, don’t start playing until they have downloaded the entire file from the location where it’s stored.

So a range of mitigation is employed to eliminate jitter and provided the buffer has sufficient data stored in it. Network switches do not have a jitter buffer. Jitter buffering is a feature that is implemented in the software or hardware of the receiving end of a network connection, such as a computer or an audio codec. Ultimately, switches are hardware devices that are responsible for forwarding network packets based on their destination MAC address. They do not have the capability to interpret the contents of the packets, buffer them, or manipulate their timing. The primary function of a switch is to forward network packets as quickly as possible and to minimize network congestion by managing the flow of data. So to claim that a network switch can have a perceptible effect on the quality of a stream of frames or packets containing digital music data is preposterous.

Network Timing


Recently, in the “Extreme Snake Oil” thread a member posted up a YouTube video of a gentleman named Hans Beekhuyzen who put forward a set of ideas meant to convince the viewer that internal clocks and network timing could cause some sort of distortion on the waveform decoded by the DAC. His explanation of how this effect occurs merely displays his lack of understanding rather than some sort of amazing revelation. Timing on networks is done by the synchronisation of devices to one of the current timing protocols available to Ethernet devices. The most common is NTP (Network Time Protocol) but there is also Precision Time Protocol (PTP) or Synchronous Ethernet (SyncE) to provide highly accurate timing over the network. For an average home router/modem, the most common time protocol used is Network Time Protocol (NTP). NTP is widely used to synchronize the clocks of network devices and provides an accurate and reliable way to maintain accurate time on the network. Home routers/modems typically use NTP to synchronize with NTP servers on the internet. The router/modem will periodically query one or more NTP servers to determine the correct time and adjust its internal clock accordingly. This helps to ensure that the device's clock is accurate and that it can accurately timestamp network traffic.

Routers
View attachment 263151
I notice nobody has seriously proposed a HiFi router yet but there are plenty of dreamers out there who claim that the router makes a difference to the sound of a music file. This is just as spurious as the claims made for data switches. The purpose of a router in a computer network is to forward network packets from one network to another. A router operates at the network layer (layer 3) of the OSI model and performs routing functions to determine the best path for a network packet to take based on its destination IP address. The router acts as a gateway for communication over a WAN, that we generally call the internet. Once again, routers do not buffer packets, they exist to facilitate communication from Layer 2 to Layer 3.

DAC’s

The DAC represents the best possible target for a whack with the “High End” stick. I went into the subject of DAC’s in my music streaming post earlier and pointed out that there are only a handful of manufacturers making quality DAC chips. Texas Instruments, ESS Technology, Cirrus Logic and AKM Semiconductor are a few of the makers and some would consider these near the top of the quality currently available. Of course, a DAC is a DAC and there are numerous support circuits and devices that are needed to create a functional HiFi component. However, as can be observed in many cases in this era, the “manufacturers don’t necessarily make everything inside the case. In many cases it’s the case that’s had the “High End” treatment and the internals are generic modules. This is true of amplifiers and DAC’s alike. In the case of the “Less Loss” DAC featured in another thread, the makers show some very janky looking internals, which are obviously not designed or manufactured by the brand company. Based on this type of activity I’m fully prepared to believe that there would be differences in the quality of sound from one DAC to another, however subtle.
View attachment 263150
Cables
View attachment 263152
OHHH PUHLEEEZZE! I know it’s just become an extension of the analogue cable miasma but it’s even more ridiculous (if that’s possible!). Ethernet can work effectively on standard, unshielded CAT5 cable over 100M. Yes that is one hundred metres, with no intermediate equipment, boosters or whatever. The data will be available at the end of that 100M cable, reliably and without serious error. USB is less effective from a distance perspective, being limited to around 5M but until that threshold is reached, the data will be retrievable and largely error free. To claim that any digital network connection can somehow influence the contents of the data frames or packets being transmitted on it is beyond ridiculous. No amount of pseudoscience, phase, RFI, EMI, Gophers or magic spells can make it true.

Conclusion

Ethernet is a demonstrably effective method of networking computers and other devices. It contains within its structure robust methods of error correction that have only become more effective as network speeds have increased. Network timing is achieved by the use of standardised protocols and is largely automatic. Music streaming services make it their business to cache or front load content on the client device to ensure low latency. Data/jitter buffering is automatic and as network speeds have increased, has also become more effective. Lost, late or corrupt packets or frames can be resent before the buffer runs out and processed in the order they were intended. No cable carrying a digital signal can influence the content of the data unless there are gross deficiencies in the cable or its environment. Even under these circumstances, the outcome will be lost, late or corrupt packets that will only be noticeable when the buffer has been depleted. Let’s be clear, the reason that routers, switches and cables do not change the contents of the frames and packets they transmit is because if they did, the whole network system would be utterly useless!!



Excellent written !
Thanks !


Bo Thunér / Linköping / Sweden
 

Hayabusa

Addicted to Fun and Learning
Joined
Oct 12, 2019
Messages
787
Likes
519
Location
Abu Dhabi
The mechanism is similar no matter how the "receiver" converts the files back. The way all these devices work is that the user's device pulls blocks of data (usually 64kByte chunks) into a buffer and the audio frames are reconstructed by the software. You can see this with a network analyser.

There's no need for a PLL, because it's all asynchronous. There's no locking onto the source.
With a live internet radio this is not the case, the source 'pushes' the data at a fixed rate.
 
Last edited:

venquessa

Member
Joined
Feb 6, 2023
Messages
59
Likes
66
Only two things know or care about your audio. The hardware that captured it and the hardware that plays it. Everything else in between just sees buckets of bits. It couldn't care less what is in them.

As to timing? Most CPUs have a timing accuracy of 1ms +/- 1ms. They can't play audio. The only way to get realtime, synchronous audio down an Ethernet cable is to send I2S over the twisted pairs instead of ethernet.

I have experience in both low-level, near realtime, single digit microsecond latency on the "Big Iron" enterprise side and also in the tiny MCU or DSP scale "realtime" space. The two couldn't be more different. Usually the interface point between those two domains (like you could consider the Spotify platform to be) often highlights audio for what it is.... old fashioned tech that was figured out years ago and mostly still uses the same technologies invented by the likes of Phillips in the 70s.

That said. I think one thing IS true around things like cables et. al. Especially if you want to play with high rate square waves and master clocks et. al.... noise.

Simple digital circuits (like audio) are very insensitive to noise. They operate, not on the shape of the wave but on the transitions. However, digital circuits are a HUGE source of noise. Swinging voltage rails from 0-5V and back again does awful things to wires, which then in turn do awful things to the EM fields around them. If you don't treat high rate digital audio with respect when you plumb it around your home movie threater you will just create a RF mess and have "bit noise" on everything. Consider that some multichannel dolby transmission formats are well into the FM Radio band and you are running it through cables how many meters long? Are you nuts? If you jam that band with DSD/I2S noise you may end up getting a call from OfCom at the door.

If you want a noise demo.... try wrapping your analogue "Line" cable around a really, really, cheap HDMI cable. Now you can hear your mouse moving. Or just plug your line in, into a cheap onboard sound card in an office PC and turn the gain up. Now you can here the computer think as well!

I think this is where Linus when from LTT on the switch idea. Is it better insulated to contain all that wideband switching noise? I don't think the conclusion was clear.

A further point. The Internet is not consequence free. Downloading and streaming 10Mbit/s audio is a criminal amount of waste of electricity and bandwidth. Not useless your pet bat likes listening to whatever it is up there above 22K these golden ears are obsessed with. Even desktop audio devices at 16/48k pull 20 or 30mA. However, the "high-end" slapped 192@24 system pulls 200-300mA. Similar effects are seen in video too. A 1080p video action style camera runs on it's battery for 4-6 hours and fills a 16Gb SD card. A 4K action camera runs out of battery in under an hour and half fills a 256Gb SD card. Editing and rendering that video will consume about 10 times the processor time and 10 times the electricity. Uploading it will take 10 times the bandwidth and 10 times the electricity. At least in that case you "can" see the difference... at least within 6 feet of a descent TV you can.

So blame global warming on audiophiles!
 
Last edited:

venquessa

Member
Joined
Feb 6, 2023
Messages
59
Likes
66
With a live internet radio this is not the case, the source 'pushes' the data at a fixed rate.
No. Not quite. The only difference in "live" versus "not live" is completely subjective and just a discussion about how long the buffers are.

When you start the audio hardware and give it it's bucket of bits it starts clocking out samples and reading the next ones from the bucket.

All "internet radio" is, is an HTTP web connection transferring an mp3 file. Nothing more. The "file" may be virtual, ie, another buffer, but it's just a application/octet-stream aka raw binary as far as the network is concerned. The vast majority of "Internet Radio" is just an MPA playlist file and a 200 line python or similar script to pay it out a network pipe. Literally something someone could write in an afternoon in Uni.

So the server end writes audio into a buffer at a normal audio rate. That buffer is probably several megabytes in size. At the same time, web clients are consuming from that buffer. They can all be listening at various points in the same buffer or many buffers.

When you connect to the stream the HTTP connection will start receiving data at FAR, FAR faster rate than audio and your copy of the buffer will be download a 30 seconds of audio in a few milliseconds... then it will start to play from the buffer.

The connection remains open. There are no push/pull mechanics, you already requested the file, it's already being transfered its just the rate at which packets arrive to keep filling your bucket of bits keeps occasionally popping another packet in the bucket. If the bucket runs out the audio stops and you get a "Buffering..." warning.
 

Hayabusa

Addicted to Fun and Learning
Joined
Oct 12, 2019
Messages
787
Likes
519
Location
Abu Dhabi
No. Not quite. The only difference in "live" versus "not live" is completely subjective and just a discussion about how long the buffers are.

When you start the audio hardware and give it it's bucket of bits it starts clocking out samples and reading the next ones from the bucket.

All "internet radio" is, is an HTTP web connection transferring an mp3 file. Nothing more. The "file" may be virtual, ie, another buffer, but it's just a application/octet-stream aka raw binary as far as the network is concerned. The vast majority of "Internet Radio" is just an MPA playlist file and a 200 line python or similar script to pay it out a network pipe. Literally something someone could write in an afternoon in Uni.

So the server end writes audio into a buffer at a normal audio rate. That buffer is probably several megabytes in size. At the same time, web clients are consuming from that buffer. They can all be listening at various points in the same buffer or many buffers.

When you connect to the stream the HTTP connection will start receiving data at FAR, FAR faster rate than audio and your copy of the buffer will be download a 30 seconds of audio in a few milliseconds... then it will start to play from the buffer.

The connection remains open. There are no push/pull mechanics, you already requested the file, it's already being transfered its just the rate at which packets arrive to keep filling your bucket of bits keeps occasionally popping another packet in the bucket. If the bucket runs out the audio stops and you get a "Buffering..." warning.
Nope! The radio station has at some point a digital stream of its audio at a certain sample rate. Lets say say 48000 Hz with a accuracy of lets say 50ppm, so the actual sample rate could be 48002 Hz.. Lets take your example: encode in mp3 frames at 192kbit/sec 1152 samples per frame 23.999 msec. If someone at the other end receives this at this rate and plays it for instance at 48000Hz minus 50 ppm -> 47998Hz, at some point in time your buffer will run over with samples. You have to do the sample rate conversion from 48002 to 47998 Hz. Its even worse: the sending side can slowly change its clock at its not perfectly stable. Indeed you can play a long time at the wrong rate if your buffer is large enough, but at some point in time you have to take action. A 1024 sample buffer would overflow in 256 seconds in the example above.
Edit: In this case I mean a real live radio station...
 
Last edited:

nerdstrike

Active Member
Joined
Mar 1, 2021
Messages
257
Likes
309
Location
Cambs, UK
Well written @Punter , just as with power cables, it's important to remember the house wiring and substation are a (nearly) immutable part of the mix, so too we should remember just how much networking and software is behind the scenes in the Internet backbone and data centre.

The truth about network-attached devices is that their interface, their ease of setup and the presence of a reasonable DAC (onboard or downstream) are all that matters. It could well be worth ££££ if the company has put in the legwork to make all the service integrations work effortlessly and for it to be a joy to use for years without a drop in support.
 

venquessa

Member
Joined
Feb 6, 2023
Messages
59
Likes
66
Edit: In this case I mean a real live radio station...

No not in this case. IN ALL CASES.

That is how this works. It's how it always worked. It can work no different. There is no physical way possible within the bounds of physics to synchronise audio clocks across any such distance. None. Physical impossibility in terms of relativity and spacetime.

The thing which receives the bucket of bits is responsible for dishing them out to the actual voltage scale hardware DAC or whatever. It has to fabricate that clock.

But the thing is. If that didn't work. If that wasn't perfectly acceptible. The whole thing, the whole modern world would not work.

The entire digital world that we know is ASYNCHRONOUS. There is NO tight coupled timing.

There are many protocols laid over the top which do bring some order of synchonousy to it. For example those buffers are active, they emit events when they get too full or too empty. Again coming back to the fact that audio in sensible rates is childs play in term of data size and rate. The CPU will spend a microsecond setting up the audio for the next few milliseconds and go off to other things.

So, in your example and I have experience of this, you are playing the audio back at a different rate to the sender. It will drift. Yes. That happens. It happens in everything. From the text on the webpage to the pixels on your monitor. Sometimes the PC isn't ready with a new frame of video. So the display just draws the last one. Sometimes the PC is able to render 100 frames of video while the display can only draw one. So 99 of those frames are overwritten.

This is how the modern world works. Thing is, it's not a big deal. It's not a big deal because 48K or 192K is childs play. It's really, really sad how slow and rudimentary even the highest of high audio data is compared to say... what is going down your HDMI cable. Just go and look at the asynchronous transfer protocols used there and wait... they are keeping pixels in sync with a 4K display at 120 Fames per second and every single pixel ends up where it was supposed to..... and you think it will have trouble with a meer 48K audio stream?
 

Hayabusa

Addicted to Fun and Learning
Joined
Oct 12, 2019
Messages
787
Likes
519
Location
Abu Dhabi
No not in this case. IN ALL CASES.

That is how this works. It's how it always worked. It can work no different. There is no physical way possible within the bounds of physics to synchronise audio clocks across any such distance. None. Physical impossibility in terms of relativity and spacetime.

The thing which receives the bucket of bits is responsible for dishing them out to the actual voltage scale hardware DAC or whatever. It has to fabricate that clock.

But the thing is. If that didn't work. If that wasn't perfectly acceptible. The whole thing, the whole modern world would not work.

The entire digital world that we know is ASYNCHRONOUS. There is NO tight coupled timing.

There are many protocols laid over the top which do bring some order of synchonousy to it. For example those buffers are active, they emit events when they get too full or too empty. Again coming back to the fact that audio in sensible rates is childs play in term of data size and rate. The CPU will spend a microsecond setting up the audio for the next few milliseconds and go off to other things.

So, in your example and I have experience of this, you are playing the audio back at a different rate to the sender. It will drift. Yes. That happens. It happens in everything. From the text on the webpage to the pixels on your monitor. Sometimes the PC isn't ready with a new frame of video. So the display just draws the last one. Sometimes the PC is able to render 100 frames of video while the display can only draw one. So 99 of those frames are overwritten.

This is how the modern world works. Thing is, it's not a big deal. It's not a big deal because 48K or 192K is childs play. It's really, really sad how slow and rudimentary even the highest of high audio data is compared to say... what is going down your HDMI cable. Just go and look at the asynchronous transfer protocols used there and wait... they are keeping pixels in sync with a 4K display at 120 Fames per second and every single pixel ends up where it was supposed to..... and you think it will have trouble with a meer 48K audio stream?
OK, could be real implementations are dropping/duplicating audio samples... But its not the way to do it. We do this sample rate conversion for years already in atsc tv's settop boxes etc...

Addition: given the earlier example this would mean a dropped or replicated sample 4 times per second!
 

venquessa

Member
Joined
Feb 6, 2023
Messages
59
Likes
66
But its not the way to do it. We do this sample rate conversion for years already in atsc tv's settop boxes etc...

Interesting. Other than sample and hold / hold and sample, what other way is there to resample temporally?

Oh, ah, sorry. Buffers. If you have a buffer of 192K audio and a player at 48K then you have the samples already, it's not a streaming problem. It's a pre-processing problem. You just loop the buffer, perform whatever filtering maths you want and write it to the 48K output buffer.

Set-top TV boxes and other digital broadcast DVB DAB et al aperatus is notorious for HUGE buffers and massive latency. I had an early DVB box when FreeView came out, when you switched channels it buffered for 5 seconds before displaying a picture. This was to give it enough time to do the bit error correction on the transport stream which is losey.

If you have a long enough buffer, as stated in the OP's article, you could in theory just download the whole track or the whole movie (or entire TV series) locally, process the entire thing to resample and re-resolution it before rendering it. It's still not synchronous.

As an experiment. Try this. Play back a CD or a DVD on several players. Record how long each takes to play the entire disc. You will be surprised by the result. You can even do this without multiple players. Just play a DVD. Don't pause it. Let it play through. At the last second note the playback timecode of the movie itself and compare it to how much REAL time has passed. Again you might be surprised by how far apart they get.

Even a 2022 brand new PC the clock displaying normal time will drift by a few seconds a day and has to be synced with a physical hardware clock and a quartz crystal periodically.

Addition: given the earlier example this would mean a dropped or replicated sample 4 times per second!

That's not a lot. Most people would not hear that. It would even be hard to detect on a scope. It will be most impactful to very high frequencies where most sounds we here up there are fairly random in nature anyway.

4 ... I ran a DSP with some new code a week back. Realtime resample, mix, EQ. It was Friday evening. I was having a few beers. Did not check my work, instead I threw the headphones on and listened to tunes through the DSP all evening. Sounded great... except, there was something odd about it. I couldn't quite put my finger on it and I was putting it down to my rubbish IIR filters causing HF distortion. Espcially as that distortion incresaed if I boosted the trebble. However it sounded "fine" after a few beers anyway.

Next morning, I was sure there was something off with it. So I scoped it. 32kHz. LOL. Not 48k, not 96k (as intended), but 32khz. Thats not even a round multiple and it's something like 64 thousand samples a second being dropped. An audiophile would have noticed straight away. I had a vague irritation, unsettled feeling it wasn't right, but I still listened to it all evening at some pretty loud volumes and I enjoyed it.
 

Hayabusa

Addicted to Fun and Learning
Joined
Oct 12, 2019
Messages
787
Likes
519
Location
Abu Dhabi
Interesting. Other than sample and hold / hold and sample, what other way is there to resample temporally?

Oh, ah, sorry. Buffers. If you have a buffer of 192K audio and a player at 48K then you have the samples already, it's not a streaming problem. It's a pre-processing problem. You just loop the buffer, perform whatever filtering maths you want and write it to the 48K output buffer.

Set-top TV boxes and other digital broadcast DVB DAB et al aperatus is notorious for HUGE buffers and massive latency. I had an early DVB box when FreeView came out, when you switched channels it buffered for 5 seconds before displaying a picture. This was to give it enough time to do the bit error correction on the transport stream which is losey.

If you have a long enough buffer, as stated in the OP's article, you could in theory just download the whole track or the whole movie (or entire TV series) locally, process the entire thing to resample and re-resolution it before rendering it. It's still not synchronous.

As an experiment. Try this. Play back a CD or a DVD on several players. Record how long each takes to play the entire disc. You will be surprised by the result. You can even do this without multiple players. Just play a DVD. Don't pause it. Let it play through. At the last second note the playback timecode of the movie itself and compare it to how much REAL time has passed. Again you might be surprised by how far apart they get.

Even a 2022 brand new PC the clock displaying normal time will drift by a few seconds a day and has to be synced with a physical hardware clock and a quartz crystal periodically.



That's not a lot. Most people would not hear that. It would even be hard to detect on a scope. It will be most impactful to very high frequencies where most sounds we here up there are fairly random in nature anyway.

4 ... I ran a DSP with some new code a week back. Realtime resample, mix, EQ. It was Friday evening. I was having a few beers. Did not check my work, instead I threw the headphones on and listened to tunes through the DSP all evening. Sounded great... except, there was something odd about it. I couldn't quite put my finger on it and I was putting it down to my rubbish IIR filters causing HF distortion. Espcially as that distortion incresaed if I boosted the trebble. However it sounded "fine" after a few beers anyway.

Next morning, I was sure there was something off with it. So I scoped it. 32kHz. LOL. Not 48k, not 96k (as intended), but 32khz. Thats not even a round multiple and it's something like 64 thousand samples a second being dropped. An audiophile would have noticed straight away. I had a vague irritation, unsettled feeling it wasn't right, but I still listened to it all evening at some pretty loud volumes and I enjoyed it.
>> Interesting. Other than sample and hold / hold and sample, what other way is there to resample temporally?

this gives a nice overview:

https://www.analog.com/media/en/technical-documentation/application-notes/ee268v01.pdf
 

venquessa

Member
Joined
Feb 6, 2023
Messages
59
Likes
66
The talk about master clock slew on proper synchronous audio paths is where the real insanity lies.

24.576Mhz for example. The maximum theoretical phase offset cause by jitter is 180* or half the period. 1/49.152Mhz seconds. About 20 nano seconds.

At these time scales we get into special and general relativity effects beyond a few meters. Light travels in a vacuum at around 30cm per nano second. So beyond 20 feet your two digital audio devices exist in different "slices" of space time and can no longer agree on what "now" is, even with reference to the clock.

You run into quantum scenarios where more than one "pulse" or "state" exists on the wire at the same time. Usually resulting in complete destruction of the clock signal.

All of these things are about 8 orders of magnitude above any human perception of time or sound.
 

Hayabusa

Addicted to Fun and Learning
Joined
Oct 12, 2019
Messages
787
Likes
519
Location
Abu Dhabi
Appreciate the effort, but there is so much superfluous information in this. You can get the same message across in about 1/4 of the words. Then more will read it, understand it, and apply it. Which is the goal I assume.

Ok The short explanation is that the receiving side creates a sort sw PPL to follow the the senders sample rate as closely as possible. These samples are somewhere 'inbetween' the samples of the receiver sample clock. The next step is to interpolate these in between samples with for example a polyphase FIR filter.
 

venquessa

Member
Joined
Feb 6, 2023
Messages
59
Likes
66
I suppose the fuel that audiophools have is that, no, streamed, downloaded or really any "remote" audio is asynchronous and... it's not perfect.

So many of these technologies are based on human perception. They are in perfect and in many cases losey because those imperfections are outside of our perception.

But of course audiophools have ears which us mere mortals do not.
 
Top Bottom