The "how many bits at which sample rate" questions etc have been competely addressed by Bob Stuart in
this paper. It does not deal with MQA directly but with the foundations for it -- the room available in a PCM coding channel for additional information, plus a wealth of other interesting things.
I have studied the BS paper and at the end I have questions without an answer.
It is clear for me that the MQA approach tries to get a perceptually lossless result in a data stream of reduced size by removing unnecessary data.
So let's assume this approach is a basic starting point.
Furthermore BS gives some more explanations:
and
The four listening scenarios listed at the start: a) No Decoder, b) A Core Decoder, c) A Full Decoder or d) when the stream is limited to 16b in playback, can each be previewed by Mastering tools, and also (in the supply chain) facilitated by a Pro-MQA Decoder. This enables a rights-holder to be certain of the sound arriving in all four scenarios from the same 24b distribution file and Authenticate it.
Ok, in his paper BS is arguing that 20 bit @ 58 kHz is sufficient to all playback situations in reference to the hearing threshold and max. level of 120 dB. It is convenient to use a standard rate of today, so 88.2 kHz or 96 kHz is ok.
Furthermore BS is talking about the power of a proper noise-shaped dithering and the application of pre- and de-emphasis.
All this sounds reasonable.
Now let's think about the origami folding.
A ist the range covered by the usual samplerates 44.1 kHz (or 48 kHz) at 16 bits. So I expect A to be correct with a playback a) or d) in comparison to e.g. a CD created by downsampling (and dithering) from a master track.
I do not know about a method to extract the 16 bit data to 22.05 kHz or 24 kHz without a proper anti-aliasing filter. But I also know that the filter is either linearphase (this introduces pre-ringing or "blur") or minimumphase (this introduces phase changes). There are many discussions about the audibility of brickwall filters.
A proper dithering method is fine but it changes or influences at least the least significant bit of the available 16 bits.
I do not really expect that a pre-emphasis is applied as there is no de-emphasis at the playback situations. But if there is a pre-emphasis there is definitely a loss in transparency.
So anyway an original MQA track of just 16 bits depth seems impossible. It would throw away too much information. At least 4 bits of the 20 bits BS has demanded in his paper. Pre-emphasis without de-empasis does not make sense. In best case we would get the same like with a CD which is claimed to be lossy.
But: as MQA has also to handle case a) and d) it must apply a brickwall filter and add dither (=noise) in the 16 bit range. Otherwise MQA would accept a playback with distortions (see BAS arguments about dither in his paper).
So when we expand the MQA track now to 24 bits it must contain noise at the 16 bit level despite we have more bits available. This may be transparent but it is not lossless.
Now B is added to the lower of the 24 bits. It contains 2 bands B1 and B2 created by "lossless" band split. If this is possible it would also make sense to split A and B lossless. But in case of a brickwall filter we may have an ideal filter = bands A + B1 do not overlap = ideal steep = endless ringing = blur. Or the two bands A and B1 will overlap. Because A and B1 gets further processed (e.g. bit reduction of A to 16 bits) the sum of the processed parts are not expected to match the original data in the transition area. Thus some real information will get lost. It is possible that the transparency is not considered to be lossy because the listener cannot detect this.
Adding B1 + B2 to the lower 4 bits of the 24 bits means that we have 20 bits left for music data coding. This would be ok (also according to the BS paper).
But of course there is the green line of MQA information coded into the stream. It looks like a higher bit is used for this, maybe bit 20. Ok, it is also said that the decoder will remove this completely.
It seems I must correct because "it is there so we can hear
more of the music when playback is limited to a 16-bit stream". By this statement the MQA info is at the level of the least significant bit = noise again. This would mean that there is no dithering like TPDF or noise-shaping. Just information noise.
Summary:
There are some mysterious "lossless" operations in the MQA statements. But anyway there is a change of the original music content . A playback of the track without MQA decoder according to d) would show up more distortions in case of no dithering. An added dither would also show up as noise at higher level during the 24 bit playback. If the added dither consists of MQA confirmation it is not random and as it is told to be removed by the decoder the original data have to be stored somewhere else for recovery (e.g. shift of bits one bit level lower).