This is an article I wrote for Widescreen Review in 2015 to show the progress we had made since the inception of that magazine in audio/video. It also covers part of my career that related to this evolution. Hope you don't hold that against it .
----
History of Audio and Video Streaming and Digital Distribution
In one of my favorite movies, Good Will Hunting, Matt Damon’s character tells a story of being on an airplane. The captain not knowing the microphone was on, says he would love to have a cup of coffee and something else that is not fit for print. After he finishes the punch line, Robin Williams’ character who is playing his therapists, challenges him as to whether he has even been on an airplane. Matt’s character answers that the joke works better when he says it in first person. So that is how this article is going to unfold. I will tell you the evolution of audio and video story with respect to part of my own career intertwined in it.
The year is 1995 or so. I am working for a company called Abekas Video Systems. That name probably isn’t familiar to you unless you have worked in television broadcast or post production. If you have, you would have known them to be one of the leading hardware companies in video effects and processing. They were to video industry, what Dell and HP are to computer field. Our customers were the highest of the high-end, the network broadcasters and major post production (editing) houses. Equipment had to perform to the highest specification of fidelity, way beyond what we do in calibrating our equipment and such. The equipment was large and rack mounted with average retail of $20,000 or more. Despite their massive scale and complexity, I had a dream that someday personal computing horsepower would be sufficient to replace it all. I didn’t know how or when but thought about the possibility all the time.
The parent company of Abekas (Carlton Television which is a major owner of networks in UK) wanted it sold and we proceeded to do exactly that. Just before that happened, I heard about Netscape and initial explosion of interest in what was becoming the “World Wide Web.” I had been fortunate to have been part of the Internet revolution back in early 1980s by working on the UNIX (parent of Linux) operating system and one of the first Ethernet implementations, running the TCP/IP protocol. That was yesterday and this was now. The revolution was continuing with the birth of the web and browsers but I was no longer in it. I felt a sense of regret and was looking for a way to get back into mainstream computing in general, and the web revolution in the specific.
Back to the sale of Abekas, as soon as that went through, I got an offer to run engineering at another video company called Pinnacle Systems (now part of Avid). I accepted the position as their products were PC based and I thought it brought me a step closer to the dream that computers could be powering high-end video one day. Alas, Pinnacle was only a tiny step toward that as it still employed tons of hardware in the form of dense cards that plugged into the PC. The computer was just running the user interface and no more. I was looking for the computer to play the central role, not a side job.
One day I decided to connect with my old boss and CEO of Abekas and we got together over lunch. I go there and ask him what he was doing and he says he is running a start-up that is doing video on the web. Video on the web? Was he kidding? Surely that was not possible. Why? Because access to the Internet at that time was using dial-up modems. Broadband was still in trials and people were dubious of the prospects of it becoming mainstream. Heck, even on dial-up we had 28 Kbits/sec modems and “new” 56 Kbits/sec modems were just come to market.
Let’s do a bit of math to see why I was so incredulous that anyone would attempt to push video through the web at that time. Uncompressed broadcast quality video has a resolution of 720x480. That means we have 345,600 pixels per frame of progressive video. If we wanted to send 30 frames per second, this number would balloon to 10,368,000 pixels per second. In the highest end video encoding mode of 4:4:4, each pixel would require 24 to 30 bits (8 and 10 bits per component respectively). The common mode though is 4:2:2 which drops one of color bandwidth in half. If we assume 8 bits per component, we get 16 bits. Consumer video broadcast halves this yet again to 4:2:0 so we are now left with 8 bits for black and white (Luma) and 4 bits on the average per pixel for color (Chroma). This means each pixel needs 12 bits to describe it. Multiplying this by our total number of pixels gives us 124,416 Kbits/second.
Traffic on the web uses the TCP/IP protocol which adds its own layer of overhead. As a result our “28 Kbit/sec” modem speed shrinks down to 22 to 24 Kbits/second. And we are trying to do what with that? Push 124,416 Kbits/second through it??? Taking a ratio of those two, our channel has a capacity of just 0.00002% of what we need. And that is forgetting about audio which is a much peskier thing in amount of compression it can withstand. Yes we can apply video compression but this type of shrinkage requires miracles and prayer.
The “video” demo that my ex-boss showed me did not have a resolution of 720x480, or even one fourth of that at 180x120. No sir. It was a tiny little postage stamp window that updated a few frames per second here and there. To any lay person the experience was totally underwhelming. And to a broadcaster, a joke. But not for me. I stood there in awe as I watched a video of a Stanford University class lecture. The video was blurry and not much more than a slide show. But provided an experience that was simply not possible before. Clicking on a link in a browser and instantly watching the lecture gave you a sense of being in that classroom, the poor quality notwithstanding. For me, this was it. It was my way to get back into the computer world while still utilizing my hard earned video experience. Two months later I was there running engineering. Then 18 months later I was at Microsoft which acquired the company.
When I got to Microsoft my first question was why they had acquired us. I mean, this technology did not remotely make money so why would a major corporation have any interest in it? The answer I got back was this: “Bill Gates thinks one day television will be software.” Software? All of it? OK, I thought some of it would be software but all of it? Oh well, I am getting paid to work on something I absolutely loved so why question it.
While video was struggling to establish roots on the Internet, audio revolution there was in full swing at that time. Not as a streaming solution but for downloading and exchanging music. Any music file could be downloaded in MP3 and play it in a few minutes on your PC. Most everyone thinks the key enabler for this was the MP3 codec. But in reality it was another unsung hero. MP3 codec just like its MPEG-2 video counterpart was designed for implementation in hardware. With no hardware at the time to play it, no one was going to use it to make this transformation occur.
Music compression works in “frequency domain” meaning we take the audio samples in time and decompose them into fundamental frequency bands that represent it. Using psychoacoustics science we take out data which has minimal impact on audibility. This is the lossy portion of compression. There is a lossless step then that compresses these so called “frequency coefficients” into a compact “bit stream.” To play the file, the losslessly compressed bit stream is expanded followed by conversion back to time domain, i.e. PCM audio samples ready to play. No, there is no exam and you don’t have to understand this detail. The key to note is that a lot of mathematical operations are required. This is why I mentioned the original expectation of the standard being a hardware implementation.
Something changed unexpectedly. Computers, namely their CPUs suddenly become fast enough to decode MP3 at 128 Kbps. The moment this happened, we had millions of computers becoming music players. Without this having happened, there would have been no Napster, and quite possibly no iPod, or iPhone!!!
The hardware version of MP3 did play a pivotal role. In 1998 Diamond Multimedia which was a major supplier of computer graphics cards, decided to build a portable music player around MP3 codec. It used flash memory so could only hold about a dozen tracks of music (32 Megabytes). At the heart of the player was a piece of silicon from a German company by the name of Micronas. The chip was actually developed for deployment of European digital FM broadcasting. It was repurposed by Diamond Multimedia to decode MP3s from flash memory. And with this development, a new category of consumer electronics was born: the small portable music player.
Both Micronos and Diamond disappeared from the scene due to competition. One of them was a little-known start up called Portal Player. They had built a programmable single chip music player. We worked with them while I was at Microsoft to implement our WMA audio codec on their chip. These programmable parts could handle multiple codecs and hence our work with them to add that functionality in addition to MP3.
Another unsung hero was small hard disks being produced at the time by companies such as IBM, Hitachi and Toshiba. These were much smaller and lighter than laptop drivers. Unfortunately the capacity was also sharply reduced which heavily limited their application. They were on the way out before a remarkable thing happened. Apple chose to build a portable music player by combining the Portal Player chip and the smallest of these drives from Toshiba. Translation? Apple iPod was born in 2001 and with a massive marketing budget became an overnight success.
A key aspect of Apple’s success was an exclusive on the Toshiba drive. No one else was making those drives in such small form factor and as a result, apple succeeded to keep competitors at bay quite effectively. This is not to undermine all the great things that Apple did in user interface and industry design. Without that exclusive use of the Toshiba drive, it is entirely possible that other competitors could have caught up with Apple much as they have done with Android phones and tablets.
In case you are curios why we developed WMA, let me explain the motivation behind it. Just like video, one way to reduce the bandwidth of audio is to chop it down. Convert it to mono, and keep bringing down the sampling rate and bandwidth it provided and eventually you can stuff it in a dial-up modem. Those degradations are noticeable of course. On a 56 Kbits/sec modem, you had audio fidelity that was a bit above AM which was mono and muffled. The goal with WMA codec was to produce FM quality on dial-up modems. It also halved the required storage for “CD quality” on portable music players which used expensive flash memory. We achieved both of those goals more or less.
One of the major challenges for delivery of audio and video on the Internet is its variable throughput. Even with high bandwidth broadband links I am sure you all have hit on a YouTube link only to see it buffer. The Internet is a “best effort” type of channel meaning it provides no guarantee of throughput whatsoever. Your ISP oversells their capacity hoping that not everyone uses peak traffic at the same time. But should that happen, things get slow. Video and audio on the other hand are “real-time” events. You can’t pause the audio constantly or the user gets frustrated. So somehow we have to match these two opposing systems together.
At the start-up I worked at, VXtreme, we came up with a clever solution to this problem. Instead of encoding at just one bit rate, we encoded audio and video at multiple rates. The player would make an educated guess as to the starting bit rate based on past history. One of two things would then happen. If the link ran too slow, say 200 Kbits/sec versus encoding rate of 300 Kbits/sec, the player would switch down to a lower fidelity layer at for example 150 Kbits/sec. The audio and video quality would go down but the stream would keep playing. Conversely, if the link speed was faster, it would select a higher fidelity layer. Both of these led to much more satisfying experience than pausing and buffering.
To avoid oscillation, the system doesn’t instantly switch up and down. You don’t want a situation where you jump to a higher fidelity layer only to find insufficient bandwidth and having to switch back down to lower fidelity a second or two later. This is why you may see your streaming movie service switch to a lower fidelity and stay there for a while.
This technology was called MBR or multi-bitrate but its common name today is “smart” or “intelligent” streaming. It is in play whether you watch YouTube or Netflix. Yes, you can still see buffering messages. If the lowest encoded bit rate is still too high for the actual throughput you have to the server, there is no choice but to pause playback and read ahead. Once there is enough read into memory (“buffer”) the systems starts to play again. If there is insufficient bandwidth still, you will keep getting these periods of pauses and playback.
Another interesting aspect of streaming was that for years and years, playback occurred through extensions to the browser and stand-alone players. Implementation in the browser in the form of “HTML 5” which is very common today, came much later. Audio/video on the web simply was not taken seriously for quite a long time. I remember running into my pro video colleagues and have them ask me why I was wasting my time with Internet video and wouldn’t go back to working for video companies. The tide change came when Google paid $1.65 billion dollars for YouTube. Now it was no longer a curiosity on the side of the web. All of a sudden I am getting email notification that we are nominated and eventually win a (technical) Emmy award from no less than National Academy of Television Arts and Sciences. The industry was finally seeing the Internet streaming as real.
From Left to right, Will Poole, myself and Anthony Bay (my bosses during my time at Microsoft), celebrating our Emmy Award granted graciously to us by National Academy of Television Arts and Sciences in 2006 for innovation in streaming technologies.
When I stood up to take the Emmy award at the ceremony, the story I told was what I heard when I got to Microsoft: “Bill Gates said one day Television would become software.” And boy has he been right. The entire system of video delivery end to end is through software.
Please excuse me now as I go to get a cup of coffee with this story and five dollars...
Amir Majidimehr is the founder of Madrona Digital (www.madronadigital.com) which specializes in custom home electronics. He started Madrona after he left Microsoft where he was the Vice President in charge of the division developing audio/video technologies. With more than 30 years in the technology industry, he brings a fresh perspective to the world of home electronics.
----
History of Audio and Video Streaming and Digital Distribution
In one of my favorite movies, Good Will Hunting, Matt Damon’s character tells a story of being on an airplane. The captain not knowing the microphone was on, says he would love to have a cup of coffee and something else that is not fit for print. After he finishes the punch line, Robin Williams’ character who is playing his therapists, challenges him as to whether he has even been on an airplane. Matt’s character answers that the joke works better when he says it in first person. So that is how this article is going to unfold. I will tell you the evolution of audio and video story with respect to part of my own career intertwined in it.
The year is 1995 or so. I am working for a company called Abekas Video Systems. That name probably isn’t familiar to you unless you have worked in television broadcast or post production. If you have, you would have known them to be one of the leading hardware companies in video effects and processing. They were to video industry, what Dell and HP are to computer field. Our customers were the highest of the high-end, the network broadcasters and major post production (editing) houses. Equipment had to perform to the highest specification of fidelity, way beyond what we do in calibrating our equipment and such. The equipment was large and rack mounted with average retail of $20,000 or more. Despite their massive scale and complexity, I had a dream that someday personal computing horsepower would be sufficient to replace it all. I didn’t know how or when but thought about the possibility all the time.
The parent company of Abekas (Carlton Television which is a major owner of networks in UK) wanted it sold and we proceeded to do exactly that. Just before that happened, I heard about Netscape and initial explosion of interest in what was becoming the “World Wide Web.” I had been fortunate to have been part of the Internet revolution back in early 1980s by working on the UNIX (parent of Linux) operating system and one of the first Ethernet implementations, running the TCP/IP protocol. That was yesterday and this was now. The revolution was continuing with the birth of the web and browsers but I was no longer in it. I felt a sense of regret and was looking for a way to get back into mainstream computing in general, and the web revolution in the specific.
Back to the sale of Abekas, as soon as that went through, I got an offer to run engineering at another video company called Pinnacle Systems (now part of Avid). I accepted the position as their products were PC based and I thought it brought me a step closer to the dream that computers could be powering high-end video one day. Alas, Pinnacle was only a tiny step toward that as it still employed tons of hardware in the form of dense cards that plugged into the PC. The computer was just running the user interface and no more. I was looking for the computer to play the central role, not a side job.
One day I decided to connect with my old boss and CEO of Abekas and we got together over lunch. I go there and ask him what he was doing and he says he is running a start-up that is doing video on the web. Video on the web? Was he kidding? Surely that was not possible. Why? Because access to the Internet at that time was using dial-up modems. Broadband was still in trials and people were dubious of the prospects of it becoming mainstream. Heck, even on dial-up we had 28 Kbits/sec modems and “new” 56 Kbits/sec modems were just come to market.
Let’s do a bit of math to see why I was so incredulous that anyone would attempt to push video through the web at that time. Uncompressed broadcast quality video has a resolution of 720x480. That means we have 345,600 pixels per frame of progressive video. If we wanted to send 30 frames per second, this number would balloon to 10,368,000 pixels per second. In the highest end video encoding mode of 4:4:4, each pixel would require 24 to 30 bits (8 and 10 bits per component respectively). The common mode though is 4:2:2 which drops one of color bandwidth in half. If we assume 8 bits per component, we get 16 bits. Consumer video broadcast halves this yet again to 4:2:0 so we are now left with 8 bits for black and white (Luma) and 4 bits on the average per pixel for color (Chroma). This means each pixel needs 12 bits to describe it. Multiplying this by our total number of pixels gives us 124,416 Kbits/second.
Traffic on the web uses the TCP/IP protocol which adds its own layer of overhead. As a result our “28 Kbit/sec” modem speed shrinks down to 22 to 24 Kbits/second. And we are trying to do what with that? Push 124,416 Kbits/second through it??? Taking a ratio of those two, our channel has a capacity of just 0.00002% of what we need. And that is forgetting about audio which is a much peskier thing in amount of compression it can withstand. Yes we can apply video compression but this type of shrinkage requires miracles and prayer.
The “video” demo that my ex-boss showed me did not have a resolution of 720x480, or even one fourth of that at 180x120. No sir. It was a tiny little postage stamp window that updated a few frames per second here and there. To any lay person the experience was totally underwhelming. And to a broadcaster, a joke. But not for me. I stood there in awe as I watched a video of a Stanford University class lecture. The video was blurry and not much more than a slide show. But provided an experience that was simply not possible before. Clicking on a link in a browser and instantly watching the lecture gave you a sense of being in that classroom, the poor quality notwithstanding. For me, this was it. It was my way to get back into the computer world while still utilizing my hard earned video experience. Two months later I was there running engineering. Then 18 months later I was at Microsoft which acquired the company.
When I got to Microsoft my first question was why they had acquired us. I mean, this technology did not remotely make money so why would a major corporation have any interest in it? The answer I got back was this: “Bill Gates thinks one day television will be software.” Software? All of it? OK, I thought some of it would be software but all of it? Oh well, I am getting paid to work on something I absolutely loved so why question it.
While video was struggling to establish roots on the Internet, audio revolution there was in full swing at that time. Not as a streaming solution but for downloading and exchanging music. Any music file could be downloaded in MP3 and play it in a few minutes on your PC. Most everyone thinks the key enabler for this was the MP3 codec. But in reality it was another unsung hero. MP3 codec just like its MPEG-2 video counterpart was designed for implementation in hardware. With no hardware at the time to play it, no one was going to use it to make this transformation occur.
Music compression works in “frequency domain” meaning we take the audio samples in time and decompose them into fundamental frequency bands that represent it. Using psychoacoustics science we take out data which has minimal impact on audibility. This is the lossy portion of compression. There is a lossless step then that compresses these so called “frequency coefficients” into a compact “bit stream.” To play the file, the losslessly compressed bit stream is expanded followed by conversion back to time domain, i.e. PCM audio samples ready to play. No, there is no exam and you don’t have to understand this detail. The key to note is that a lot of mathematical operations are required. This is why I mentioned the original expectation of the standard being a hardware implementation.
Something changed unexpectedly. Computers, namely their CPUs suddenly become fast enough to decode MP3 at 128 Kbps. The moment this happened, we had millions of computers becoming music players. Without this having happened, there would have been no Napster, and quite possibly no iPod, or iPhone!!!
The hardware version of MP3 did play a pivotal role. In 1998 Diamond Multimedia which was a major supplier of computer graphics cards, decided to build a portable music player around MP3 codec. It used flash memory so could only hold about a dozen tracks of music (32 Megabytes). At the heart of the player was a piece of silicon from a German company by the name of Micronas. The chip was actually developed for deployment of European digital FM broadcasting. It was repurposed by Diamond Multimedia to decode MP3s from flash memory. And with this development, a new category of consumer electronics was born: the small portable music player.
Both Micronos and Diamond disappeared from the scene due to competition. One of them was a little-known start up called Portal Player. They had built a programmable single chip music player. We worked with them while I was at Microsoft to implement our WMA audio codec on their chip. These programmable parts could handle multiple codecs and hence our work with them to add that functionality in addition to MP3.
Another unsung hero was small hard disks being produced at the time by companies such as IBM, Hitachi and Toshiba. These were much smaller and lighter than laptop drivers. Unfortunately the capacity was also sharply reduced which heavily limited their application. They were on the way out before a remarkable thing happened. Apple chose to build a portable music player by combining the Portal Player chip and the smallest of these drives from Toshiba. Translation? Apple iPod was born in 2001 and with a massive marketing budget became an overnight success.
A key aspect of Apple’s success was an exclusive on the Toshiba drive. No one else was making those drives in such small form factor and as a result, apple succeeded to keep competitors at bay quite effectively. This is not to undermine all the great things that Apple did in user interface and industry design. Without that exclusive use of the Toshiba drive, it is entirely possible that other competitors could have caught up with Apple much as they have done with Android phones and tablets.
In case you are curios why we developed WMA, let me explain the motivation behind it. Just like video, one way to reduce the bandwidth of audio is to chop it down. Convert it to mono, and keep bringing down the sampling rate and bandwidth it provided and eventually you can stuff it in a dial-up modem. Those degradations are noticeable of course. On a 56 Kbits/sec modem, you had audio fidelity that was a bit above AM which was mono and muffled. The goal with WMA codec was to produce FM quality on dial-up modems. It also halved the required storage for “CD quality” on portable music players which used expensive flash memory. We achieved both of those goals more or less.
One of the major challenges for delivery of audio and video on the Internet is its variable throughput. Even with high bandwidth broadband links I am sure you all have hit on a YouTube link only to see it buffer. The Internet is a “best effort” type of channel meaning it provides no guarantee of throughput whatsoever. Your ISP oversells their capacity hoping that not everyone uses peak traffic at the same time. But should that happen, things get slow. Video and audio on the other hand are “real-time” events. You can’t pause the audio constantly or the user gets frustrated. So somehow we have to match these two opposing systems together.
At the start-up I worked at, VXtreme, we came up with a clever solution to this problem. Instead of encoding at just one bit rate, we encoded audio and video at multiple rates. The player would make an educated guess as to the starting bit rate based on past history. One of two things would then happen. If the link ran too slow, say 200 Kbits/sec versus encoding rate of 300 Kbits/sec, the player would switch down to a lower fidelity layer at for example 150 Kbits/sec. The audio and video quality would go down but the stream would keep playing. Conversely, if the link speed was faster, it would select a higher fidelity layer. Both of these led to much more satisfying experience than pausing and buffering.
To avoid oscillation, the system doesn’t instantly switch up and down. You don’t want a situation where you jump to a higher fidelity layer only to find insufficient bandwidth and having to switch back down to lower fidelity a second or two later. This is why you may see your streaming movie service switch to a lower fidelity and stay there for a while.
This technology was called MBR or multi-bitrate but its common name today is “smart” or “intelligent” streaming. It is in play whether you watch YouTube or Netflix. Yes, you can still see buffering messages. If the lowest encoded bit rate is still too high for the actual throughput you have to the server, there is no choice but to pause playback and read ahead. Once there is enough read into memory (“buffer”) the systems starts to play again. If there is insufficient bandwidth still, you will keep getting these periods of pauses and playback.
Another interesting aspect of streaming was that for years and years, playback occurred through extensions to the browser and stand-alone players. Implementation in the browser in the form of “HTML 5” which is very common today, came much later. Audio/video on the web simply was not taken seriously for quite a long time. I remember running into my pro video colleagues and have them ask me why I was wasting my time with Internet video and wouldn’t go back to working for video companies. The tide change came when Google paid $1.65 billion dollars for YouTube. Now it was no longer a curiosity on the side of the web. All of a sudden I am getting email notification that we are nominated and eventually win a (technical) Emmy award from no less than National Academy of Television Arts and Sciences. The industry was finally seeing the Internet streaming as real.
From Left to right, Will Poole, myself and Anthony Bay (my bosses during my time at Microsoft), celebrating our Emmy Award granted graciously to us by National Academy of Television Arts and Sciences in 2006 for innovation in streaming technologies.
When I stood up to take the Emmy award at the ceremony, the story I told was what I heard when I got to Microsoft: “Bill Gates said one day Television would become software.” And boy has he been right. The entire system of video delivery end to end is through software.
Please excuse me now as I go to get a cup of coffee with this story and five dollars...
Amir Majidimehr is the founder of Madrona Digital (www.madronadigital.com) which specializes in custom home electronics. He started Madrona after he left Microsoft where he was the Vice President in charge of the division developing audio/video technologies. With more than 30 years in the technology industry, he brings a fresh perspective to the world of home electronics.
Last edited: