Why did you choose to go with height gen. instead of matrixing?
Upmixing by itself is a problem that is mathematically impossible. Let's say, your front left has a value of 4, your front height is 2, and added together (downmixed), it's 6. Your task would sound like this in math: the result is 6, what numbers did I add together? There are infinite solutions, and only one is right. Matrix encoding is a hack to this, but it requires a content to be specifically made for your solution. It means the equations are already known, for example your left front source after mixing would sound from the left front in 50% volume, 25% from the center, and 25% from the left side. This is a bed example, but illustrates the problem nicely. This can be done in reverse, but other channels are also mixed to your outputs, so a huge crosstalk will always be present. My guess is that Auro chose the height speakers instead of top ones not because they're better, but because this hides the fact that speech can't be contained on the center with their method, it will have a considerable chunk of it on the center height. Since I have no say in the authoring tools, the Auro equations are unknown (can be reverse engineered very easily, but I don't have any kind of decoder), and Dolby is varying these equations on the fly (metadata), matrixing is a no-go.
Matrixing is a great solution for transmission of 5.1 in 2.0, but that's the most with good enough crosstalk. The LFE can obviously be straight-up included in other channels after gain-matching (IMAX is using this in practice, they have no LFE channel, just a crossover),
center is usually done with this method, and surrounds are phase-matrixed with a Hilbert-transform, there are 4 ways of 2 + or - signals to be contained, all of them are present. Since there are no other variations, it's extinguished all possibilities, other channels can only be added with way larges crosstalk.
QMFB, I've tried to read about it but can't find anything entry-level. Can you recommend something, or maybe give some examples of subband criteria?
It's just simply a band selection algorithm. It has a formula that separates part of a signal defined by some characteristics (not just by frequency). The metadata contains how much of each band for each input channel is used for one output channel, so it's basically a dynamic matrix after the disassembly of the source signal in a known way.