• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

ZMF Caldera Headphone Review

Rate this headphone:

  • 1. Poor (headless panther)

    Votes: 48 27.0%
  • 2. Not terrible (postman panther)

    Votes: 84 47.2%
  • 3. Fine (happy panther)

    Votes: 29 16.3%
  • 4. Great (golfing panther)

    Votes: 17 9.6%

  • Total voters
    178

Mulder

Addicted to Fun and Learning
Forum Donor
Joined
Sep 2, 2020
Messages
646
Likes
896
Location
Gothenburg, Sweden
Gotcha, I didn't know if you were friends or some such and wanted to ask.
I trust @amirm's independence and transparency and candor on these matters. I can't see any indication of extraneous bias due to ties to certain producers. In that respect, ASR is like a waterhole in a desert. There are a few more online that I trust, such as @solderdude's website, but overall the lack of transparency in the HiFi industry is mind-numbing. One can clearly guess, in terms of financial interests or other ties between the manufacturers and the so-called reviewers, that the readers or viewers are the product and the manufacturers are the customer. Sorry if this is of-topic.
 
Last edited:

Godataloss

Senior Member
Joined
Aug 16, 2021
Messages
473
Likes
518
Location
Northern Ohio
The subjective listening tests were correlated to objective measurements. This is what Dr. Toole and Olive have done for nearly 40 years. My frequency response measurements are 100% objective. So nothing should melt your head. This is research that is of incredible value to us as consumers and we are putting it to work.
I get it, but to me it's a bit like saying- "We did a poll and everyone prefers chocolate ice cream- therefore all ice cream should be chocolate!".
 

Soria Moria

Senior Member
Joined
Mar 17, 2022
Messages
417
Likes
865
Location
Norway
I get it, but to me it's a bit like saying- "We did a poll and everyone prefers chocolate ice cream- therefore all ice cream should be chocolate!".
Bad comparison. As most food analogies are. The Harman target is an excellent and also the best starting point we have on how to make a headphone sound tonally correct. Most people will prefer it as is so it's also the best starting point. But it's important to remember that EQ should also always be available to personalise it for the best result should you find it to sound better with a little more/less of this and that. I personally adjust the bass a little bit but I otherwise find it tonally perfect. Having every headphone in the world tuned to Harman for stock would be way better than the wild west we have now as Amir calls it. Isn't the point in making a good headphone to make one that the most people possible will agree sounds correct?
 

solderdude

Grand Contributor
Joined
Jul 21, 2018
Messages
16,146
Likes
36,797
Location
The Neitherlands
I get it, but to me it's a bit like saying- "We did a poll and everyone prefers chocolate ice cream- therefore all ice cream should be chocolate!".
What's up with the lousy analogies ?
 
OP
amirm

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,785
Likes
242,553
Location
Seattle Area
I get it, but to me it's a bit like saying- "We did a poll and everyone prefers chocolate ice cream- therefore all ice cream should be chocolate!".
You are close. If we were to pick a flavor of ice cream that most people would like, wouldn't chocolate be the one instead of a random one? Pretty sure Dorian Fruit flavored ice cream wouldn't make it.....
 
OP
amirm

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,785
Likes
242,553
Location
Seattle Area
So yes, model is not perfect, the paper says that much as well, but if you are doing it by eye, you are not doing it objectively or based on scientific research. It is your opinion.
No. This is not "science:"

index.php


That level of variation in preference cannot be proven. I can breath on the test fixture and those numbers change. I can change the music and preference changes. I can change listeners and preference changes. No way you can defend numbers with that level of accuracy, in the model or in real preference tests. As you said, the degree of error is "± 6.7" which means you have to allow 14 points of ambiguity. Tables like above that you post violate that. It is the difference between precision and resolution.

If you give a school ruler to two people to measure a coin and one says it is about 0.5 inches and the other says it is 0.3965 inches according to science, which do you believe? You better say the former or we won't be friends. :) We are all taught the concept of "significant digits" in school. Let's not forget.

If you listen to Sean, he will tell you that the score of 0 to 100 was for marketing reasons, not technical or scientific ones. When I get into speaker testing, folks started to compute the preference score down to two decimal places. I immediately objected and we settled on one decimal place. But even that bothers me although we kind of need it as the range there is so compressed in the middle.

As far as I am concerned, the only defensible preference rating is from 1 to 5 with 1 being awful and 5 being great. The system simply does not yield itself to higher precision. This is why you see the few rankings I use for my reviews.
 
OP
amirm

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,785
Likes
242,553
Location
Seattle Area
Paper explains the outlier that sits at the bottom and provides a potential reason as to why they think it does not fit the predictor formula.
Yeh and those reasons are absolutely applicable to us:

"Two outliers were found in the model (HP6 and
HP30) that produced higher predicted
preference ratings than observed. Both
headphones have audible medium Q resonances
that we believe are underestimated in the model.
"

See? That says the model is wrong as it can't predict those preferences correctly. Not that the model is correct and something is wrong with the measurements. The Caldera suffers from the problem they talk about:

index.php


I consider the trough between 800 Hz and 2 kHz "medium Q." In my equalization, I picked 4 as the Q.

They go on to say: "An updated version of the model will address this issue in the future."

Until you have that model, you are dealing with rather inexact model.

What I do by eye is not only look at the deviation but apply my knowledge and experience of psychoacoustics to what I am seeing. This is precisely what Dr. Toole, Olive, etc. would do if you showed them a measurement of a headphone or speaker. They would NOT run to compute the model to give you a good/bad answer.

As back up to this, Harman never used the preference score to design speakers even though it is more robust than this model for headphones.

The purpose of the model is to say, "look, frequency response matters and we can even model it using linear prediction."

To be sure, bad numbers likely mean a bad headphone. And good numbers mean it is likely a good headphone. But that is it. So please do go computing two digit numbers like that and defending them against an experienced eye analyzing the response measurements.
 

isostasy

Senior Member
Joined
Oct 19, 2022
Messages
354
Likes
637
What's up with the lousy analogies ?

It's a form of informal fallacy called false analogy. A comparison is made based on perceived superficial similarities, but more significant differences are ignored which prove the comparison to be false.

Confirmation bias is also at work because everyone making up these rubbish analogies has a strong conviction and they're actively looking for an analogy that backs up what they believe. Therefore as soon as they come up with something that fits their belief, they jump to "aha, that's it!" rather than critically questioning their belief and the analogy.

I get it, but to me it's a bit like saying- "We did a poll and everyone prefers chocolate ice cream- therefore all ice cream should be chocolate!".

I agree. To me it's a bit like saying "we did a survey of the country on how to punish murderers and most people advocate capital punishment - therefore all murderers should be executed!"
 

DenverW

Member
Joined
Dec 11, 2023
Messages
34
Likes
23
I'm speaking in absolutes because it's a fact that headphones are for listening to music! It's not just my opinion :D

If you think otherwise, for example that they are primarily to look good, then that's "gear porn". Incidentally, I'm fine with you having the opinion that headphones aren't for reproducing a signal (in this case music) but are for looking at, but in that case I wonder why you've come to a forum called Audio Science Review, which is primarily concerned with just that.

The analogy worked well because it told you what you wanted to hear rather than being viewed critically.
Ouch! So not only do you completely change what you said, you do it in a way that makes it seem like you're still correct. First headphones are only for reproducing a signal, now they're for listening to music. That's quite different in itself. I still disagree with the absolute, as headphones have other purposes aside from and in addition to listening to music. For example, gaming headphones, aviation headphones, office headphones...etc.

Then you invent a hypothetical scenario regarding how I view headphones to condescend to me about my preferences and draw conclusions as if I were the one who said it. Was it only a few posts above I mentioned 'sound' and then 'comfort' are my two most important factors? I can tell you I DON'T come here for this.

You explaining why an analogy worked for me is....what? It's an analogy, not a scientific paper, let it go. And stop telling me what I think.
 

IAtaman

Major Contributor
Forum Donor
Joined
Mar 29, 2021
Messages
2,428
Likes
4,228
I get it, but to me it's a bit like saying- "We did a poll and everyone prefers chocolate ice cream- therefore all ice cream should be chocolate!".
No, preference research on headphones is not like saying majority prefers chocolate ice-cream. That is not an ice cream preference study, that would be a ice-cream flavor preference study. In the research other factors that might be in play such as the music type is controlled for.

If you wanna go for an ice cream analogy, I think a better one would be to say "most people prefer a 20-40-40 sugar to milk to fat ratio by volume in their ice creams" (I made the numbers and contents up as I have no idea about ice cream manufacturing), meaning that if you make people taste ice cream unsighted, with no influence from flavor or brand, you'd find the closer the ratio is to 20-40-40, the more preference an ice cream gets. Sure, some might like more sugar, some might prefer more milky, and preference scores might not have any meaning to you personally. But if you are looking for a way to objectively evaluate ice creams, 20-40-40 ratio should be the target against which you evaluate.
 

IAtaman

Major Contributor
Forum Donor
Joined
Mar 29, 2021
Messages
2,428
Likes
4,228
That level of variation in preference cannot be proven. I can breath on the test fixture and those numbers change. I can change the music and preference changes. I can change listeners and preference changes. No way you can defend numbers with that level of accuracy, in the model or in real preference tests. As you said, the degree of error is "± 6.7" which means you have to allow 14 points of ambiguity. Tables like above that you post violate that. It is the difference between precision and resolution.

If you give a school ruler to two people to measure a coin and one says it is about 0.5 inches and the other says it is 0.3965 inches according to science, which do you believe? You better say the former or we won't be friends. :) We are all taught the concept of "significant digits" in school. Let's not forget.
No disagreements here, but I would be a bit careful saying all that. After all you do rank DACs based a number one can argue can equally can be made arbitrary, and we don't even know what is the RMSE on that :)

As far as I am concerned, the only defensible preference rating is from 1 to 5 with 1 being awful and 5 being great. The system simply does not yield itself to higher precision. This is why you see the few rankings I use for my reviews.
I agree. I did share the numbers so I understand how it might come across like I am trying to say numbers are a good way to go. But that was not my intention. Absolute numbers are not the way to go. And, more importantly, I think the paper agrees as well. That's why they have created these 4 categories some of which I have mentioned - Excellent, Good, Fair and Bad, to put the headphones in, and I think that level of granularity is best we can expect.

Going back to the perception that DCA is the most complaint to Harman research, even if we disregard the absolute numbers and stick to categories, we do find that with their current line-up has only 50% Excellent headphones; that is not much better than Meze, and by absolute numbers Hifiman has 5 Excellent headphones. One might agree, disagree - this is what the current science says.
 
Last edited:
OP
amirm

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,785
Likes
242,553
Location
Seattle Area
After all you do rank DACs based a number one can argue is equally inaccurate, and we don't even know what is the RMSE on that :)
I don't rank DACs based on one number. I put them in a color coded bar graph with four categories -- precisely what I said we should do for headphones.
 

IAtaman

Major Contributor
Forum Donor
Joined
Mar 29, 2021
Messages
2,428
Likes
4,228
Yeh and those reasons are absolutely applicable to us:

"Two outliers were found in the model (HP6 and
HP30) that produced higher predicted
preference ratings than observed. Both
headphones have audible medium Q resonances
that we believe are underestimated in the model.
"

See? That says the model is wrong as it can't predict those preferences correctly. Not that the model is correct and something is wrong with the measurements. The Caldera suffers from the problem they talk about:

index.php


I consider the trough between 800 Hz and 2 kHz "medium Q." In my equalization, I picked 4 as the Q.

They go on to say: "An updated version of the model will address this issue in the future."

Until you have that model, you are dealing with rather inexact model.

What I do by eye is not only look at the deviation but apply my knowledge and experience of psychoacoustics to what I am seeing. This is precisely what Dr. Toole, Olive, etc. would do if you showed them a measurement of a headphone or speaker. They would NOT run to compute the model to give you a good/bad answer.

As back up to this, Harman never used the preference score to design speakers even though it is more robust than this model for headphones.

The purpose of the model is to say, "look, frequency response matters and we can even model it using linear prediction."

To be sure, bad numbers likely mean a bad headphone. And good numbers mean it is likely a good headphone. But that is it. So please do go computing two digit numbers like that and defending them against an experienced eye analyzing the response measurements.
Allow me to try to put it this way, all wrapped in an "according to my understanding" disclaimer please (and, for sake of clarity, this is only a criticism of your "compliance to science" evaluation, I have nothing to say about your personal assessment of the sound quality or the EQ settings you come up with)

According to research, there are 2 parameters that affects preference: deviation from target curve and the slope of the deviation, both of which are negatively correlated with almost equal weight.

If we take two hypothetical headphones, HP1 that has deviations in upper treble only, and HP2 that has deviations both in the upper treble and in bass, I think with your current evaluation method, you'd most likely conclude HP2 is less compliant to research than HP1, while in reality, since HP2 maintains a flatter slope in the error curve, it will most likely score higher and will have higher preference. Depending on the tilt, it might be the difference between Excellent or Good, or Fair or Bad.

So my point is, by not taking into account whether the "tilt" of the error is still maintaining the balance of bass and highs, you are deviating from the research, and your assessment of compliance to Harman research is not correct.

Hope this makes sense.
 
Last edited:

IAtaman

Major Contributor
Forum Donor
Joined
Mar 29, 2021
Messages
2,428
Likes
4,228
I don't rank DACs based on one number. I put them in a color coded bar graph with four categories -- precisely what I said we should do for headphones.
True, but you still keep the number, up to 1st significant digit I presume, to be able to order 123s among themselves as well. Why not do a similar thing for headphones based on actual methods the research suggest? You can drop the number if you want. Then we can actually see objectively which headphones and manufactures are most scientifically minded, and which are not.
 
Last edited:

solderdude

Grand Contributor
Joined
Jul 21, 2018
Messages
16,146
Likes
36,797
Location
The Neitherlands
Because of the measurement tolerance ?

I mean.... you can connect a DAC 10 times in a row and run the same test 10 times in a row and the generated number will always be the same.
Only the top ranking numbers might deviate 1 or so due to noise becoming the variable.

If you ever measured 1 headphone in your life you may have noticed that depending on seating and seal the measured response can be many dB's off easily which can affect the generated number much more than 1 point.
Hell... just move a headphone around on your head and some of them can change tonality quite audibly.
Also production spread can lead to generated number differences well over a few points alone.

So in this case I am with Dr. Olive and Amir and would have preferred a 1-10 scale or possibly even better a 1-5 hearts, kisses or stars or something (one could even do half stars and still have a 1-10 scale).
It is the 'list people' that seem to want the decimals even when they literally say nothing about the reality due to tolerances.

You can still have 2 headphones with the same preference score yet can have a substantially different tonal balance (tilt) in near ideal and averaged circumstances.
So a single number does not even show tonal balance which we all agree is the most important aspect. You need 2 numbers at least and know how the 'tilt' number will sound.

People will blindly buy on numbers like this and may end up with a (to them) not great sounding headphone with seal issues, poor fit, high clamping force or weight or tonal balance they do not prefer but hey... the numbers say this one is 'better' as it is 1 or 2 points higher and cheaper (or more expensive, whatever the reasoning is)

Its a nonsensical list that is abused just like the SINAD list is abused. People also should not replace a DAC simply because the SINAD is a few points higher either.
One should look at all the measurements and make a more informed decision on that. The same with headphones.
All measurements should be looked at not a silly ranked number regardless how sciency that lists appears to be.

One should listen to headphones (or at least able to return it) before one buys (DACs not so much though :)).
Measurements can be an indicator just like subjective reviews can be (depends on the reviewer of course). Those lists.... not so much.
 
Last edited:

IAtaman

Major Contributor
Forum Donor
Joined
Mar 29, 2021
Messages
2,428
Likes
4,228
Because of the measurement tolerance ?

I mean.... you can connect a DAC 10 times in a row and run the same test 10 times in a row and the generated number will always be the same.
Only the top ranking numbers might deviate 1 or so due to noise becoming the variable.

If you ever measured 1 headphone in your life you may have noticed that depending on seating and seal the measured response can be many dB's off easily which can affect the generated number much more than 1 point.
Also production spread can lead to generated number differences well over a few points alone.

So in this case I am with Dr. Olive and Amir and would have preferred a 1-10 scale or possibly even better a 1-5 stars or something (one could even do half stars and still have a 1-10 scale).
It is the 'list people' that seem to want the decimals even when they literally say nothing about the reality due to tolerances.

Personally, I did measure at least 1 headphone in my life. But even those who have not measured any headphones could still very well be aware of the variations caused in measurements by seating, seal, unit variation etc. After all you don't need to lay eggs to know about eggs.

In any case, hard to say what you are responding to since you are not quoting it, but I think we agreed absolute numbers are not the way to go, and ranges and categories are more reliable as the paper suggests as well. More pertinent is what affects preference according to the research. According to my understanding, it is not just compliance to the curve, it is also overall balance, as in, one headphone might be deviating from the target more compared to another headphone, if the deviation is balanced, it might still be more preferable. Do you have any take on that?

On the DAC part, that is not true. Numbers do play around, you can make a DAC go up or down a few places in the rankings if you want. Not on the categories probably, but on the rankings for sure. Or you can chose a different voltage or different bandwidth and whole ranking might change. Categories would not probably, but rankings could. Anyway, that is off-topic on a slightly less off-topic topic. Maybe we can do a study and find out what is the mean error on that.
 
Last edited:

solderdude

Grand Contributor
Joined
Jul 21, 2018
Messages
16,146
Likes
36,797
Location
The Neitherlands
Personally, I did measure at least 1 headphone in my life. But even those who have not measured any headphones could still very well be aware of the variations caused in measurements by seating, seal, unit variation etc. After all you don't need to lay eggs to know about eggs.
Most people, I am sure, just assume a single plot shown by someone is the absolute truth.
This, however, is absolutely not the case.

You do need to understand the science behind the measurements and the pitfalls to really understand them. And this is excluding the whole 'target' stuff.

In any case, hard to say what you are responding to since you are not quoting it, but I think we agreed absolute numbers are not the way to go, and ranges and categories are more reliable as the paper suggests as well. More pertinent is what affects preference according to the research. According to my understanding, it is not just compliance to the curve, it is also overall balance, as in, one headphone might be deviating from the target more compared to another headphone, if the deviation is balanced, it might still be more preferable. Do you have any take on that?
Yep, I do. The better headphones usually have a gradual or no slope and should not have wide bandwidth peaks/humps. Dips are usually more benign but too much of that also is not good.
Sharp peaks may or may not be very audible and this depends on the frequency (and is personal as well).
The 'smoother' the measured response the better the sound quality in general and the easier it is to correct using EQ.

When comparing DCA and Caldera it is obvious the range between 1 and 6kHz is obviously compromised with the used pads for the Caldera where the DCA response can be called exemplary, almost IEM alike.
Such dips usually lead to a perceived lesser sound quality even when tonal balance is close to the preferred one (does not mean it has to follow Harman)

On the DAC part, that is not true. Numbers do play around, you can make a DAC go up or down a few places in the rankings if you want. Not on the categories probably, but on the rankings for sure. Or you can chose a different voltage or different bandwidth and whole ranking might change. Categories would not probably, but rankings could. Anyway, that is off-topic on a slightly less off-topic topic. Maybe we can do a study and find out what is the mean error on that.
On the DAC part what I said is very true.
One can manipulate the measurements by choosing a different measurement BW or reference voltage of course and in that way get different numbers.
That, however is not the case in Amirs measurements. The 0dB (unless specified otherwise) is always 2V (and of course 4V in balanced) so there is no manipulation going on there. The measurement BW also is always the same so there is no variable there.

@amirm can tell you exactly what the delta is per measurement/run of the same device and of runs in general.

A variable seems to be hum but this, again, should not change between measurement runs unless conditions around that hum changed.

Nah... the variances between measurements of headphones (less so with speakers when measured gated) is larger than any electronic components and is factors bigger.
 
Last edited:

isostasy

Senior Member
Joined
Oct 19, 2022
Messages
354
Likes
637
Ouch! So not only do you completely change what you said, you do it in a way that makes it seem like you're still correct. First headphones are only for reproducing a signal, now they're for listening to music. That's quite different in itself. I still disagree with the absolute, as headphones have other purposes aside from and in addition to listening to music. For example, gaming headphones, aviation headphones, office headphones...etc.

Then you invent a hypothetical scenario regarding how I view headphones to condescend to me about my preferences and draw conclusions as if I were the one who said it. Was it only a few posts above I mentioned 'sound' and then 'comfort' are my two most important factors? I can tell you I DON'T come here for this.

You explaining why an analogy worked for me is....what? It's an analogy, not a scientific paper, let it go. And stop telling me what I think.

I didn't change what I said. Headphones are for reproducing a signal. Music is one such signal, and what most of us are primarily concerned with here. So no, it's not "quite different in itself", I said the same thing using different words, which I hope you'll permit me to do? If you can't get on board with basic definitions like what an audio signal is then this is my final attempt to engage with you. This isn't a matter of opinion.

Show me the last time a gaming, aviation, or office headphone was reviewed here.

I'm sorry my made up example upset you as I missed your earlier post so could only guess why you were disagreeing with my statement that headphones are for reproducing a signal. I'm sorry if I come across as blunt and I'm not trying to condescend to you about your preferences. You registered literally this week and very quickly ended up agreeing profusely with an analogy which seemed to confirm your feelings. Moreover, you began with "I know some people are having trouble with the analogy" as if we simply didn't understand it. I tried to explain how, on the contrary, the analogy just doesn't work, regardless of whether it seems to support your viewpoint (it's a fallacy, see above). You then admitted that it doesn't even matter to you how much I demonstrate the inadequacy of the analogy, not only will it not change your opinion, it will reinforce it.

So no, it's not a scientific paper, but this is Audio Science Review and if 'sound' is really most important to you then I feel you could get more out of this forum than agreeing with the first analogy that reinforces your preconceived opinion.

I like this post by @_thelaughingman : "Engage in a topic from the perspective of wanting to learn something new about the hobby of being an audiophile.". If you've joined and immediately found an analogy that supports what you already think, then in the face of being told the analogy itself is flawed, respond that you don't care, that suggests to me you haven't learnt anything.

I'm honestly trying to engage you in the spirit of this forum which I believe to be critical thinking and objective analysis and, as petty as disagreeing with an analogy seems to be, I believed pointing out a false analogy based on flawed logic was a good way to do that and easier than getting into fine details to do with frequency responses, statistical analysis, etc. I'm sorry if this came across as blunt and I'm certainly not trying to tell you what you think.
 

fredristair

Member
Joined
Mar 8, 2021
Messages
50
Likes
49
Location
Missouri
It always seemed to me that you couldn't really tune a headphone to Harmon without compromising the sound quality - Sennheiser would've done it by now and which is why the HD650 (or 600) remain so near-perfect to this day and they haven't been able to top it. A lot of people like the Stealth here but it doesn't get much praise in comparison to many other TOTL headphones. It seems like you are losing out on a lot of technicalities and why headphones (making them really amazing) are really hard to tune to Harmon - see Susvara or Utopia.
 

IAtaman

Major Contributor
Forum Donor
Joined
Mar 29, 2021
Messages
2,428
Likes
4,228
It always seemed to me that you couldn't really tune a headphone to Harmon without compromising the sound quality - Sennheiser would've done it by now and which is why the HD650 (or 600) remain so near-perfect to this day and they haven't been able to top it. A lot of people like the Stealth here but it doesn't get much praise in comparison to many other TOTL headphones. It seems like you are losing out on a lot of technicalities and why headphones (making them really amazing) are really hard to tune to Harmon - see Susvara or Utopia.
Just to clarify, because this has been a point of contention previously, when you say Harmon, do you mean Mark Harmon of NCIS?
 
Top Bottom