• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Summarization of threads using LLMs (AI)

MaxwellsEq

Major Contributor
Joined
Aug 18, 2020
Messages
1,822
Likes
2,780
Interesting. Where has the training data come from? If it's based on general internet audio discussions, it's likely to be more biased towards subjective opinion than objective measurement. If the training data came exclusively from academia, ASR and competent engineering companies it would be less likely to focus on sound and more likely to pick up on reproducibility of analysis.
 

MatrixRave

Member
Joined
Apr 17, 2024
Messages
35
Likes
27
Interesting. Where has the training data come from? If it's based on general internet audio discussions, it's likely to be more biased towards subjective opinion than objective measurement. If the training data came exclusively from academia, ASR and competent engineering companies it would be less likely to focus on sound and more likely to pick up on reproducibility of analysis.
We didn't train the models ourselves. I don't know what Meta uses as data to train Llama but you may be right that training the model yourself could improve results.
Not sure if we have enough time to gather data that can be used for training, then train the model and implement a first version for summaries.
 

MaxwellsEq

Major Contributor
Joined
Aug 18, 2020
Messages
1,822
Likes
2,780
We didn't train the models ourselves. I don't know what Meta uses as data to train Llama but you may be right that training the model yourself could improve results.
Not sure if we have enough time to gather data that can be used for training, then train the model and implement a first version for summaries.
That's the experience of my colleagues (in a rather huge example). The internet training data turns out to have lots of nasty systemic bias in it, as you would expect.

I would think that a system trained on internet data is likely to have even more cognitive bias than a troll turning up on ASR insisting they can hear a night and day difference between DACs, cables and op-amps and their wife in the kitchen agrees.
 

RandomEar

Senior Member
Joined
Feb 14, 2022
Messages
344
Likes
826
My feedback:

Mistral 7B
  • Falsely states that PA5 II is "Class A"
  • Makes up stuff about "warm and detailed sound" which wasn't stated by the reviewer
  • Incorrectly talks about issues with durability, which do not relate to the amplifier discussed in the review, but the predecessor
  • Well structured but very superficial and broadly generalized summary with very little substance

LLama3
  • Uses annoying platitudes like "a popular choice among audiophiles"
  • Makes up stuff about "smooth midrange, and tight bass"
  • "Summary" and "Key Points" are very repetitive
  • States "I did not lose much information when summarizing the thread." ???
  • Summary of the "Tower vs Bookshelf Speakers" thread looks broadly correct, but also contains many repetitions

Overall, I don't think the current state gives usable results. If "forced to choose", I would prefer a Mistral 7B, but with shorter outputs and some corrections to the obvious errors. Overall, the summaries appear very superficial and disregard most of the relevant "hard facts" from the main (first) review post of each thread. I think apart from integrating data from the images, there needs to be some directive to summarize the review post first and then add a summary of the dicsussion in a second step.

As already mentioned by others in the thread, one important limitation seems to be the context size. Could something like YaRN be helpful here?
 

Wicky

Active Member
Joined
Jun 8, 2021
Messages
194
Likes
220
Location
London
We didn't train the models ourselves. I don't know what Meta uses as data to train Llama but you may be right that training the model yourself could improve results.
Not sure if we have enough time to gather data that can be used for training, then train the model and implement a first version for summaries.
Have you considered fine-tuning these models? I've heard this gives big improvements especially if you are using the smaller 7B models?
 

MatrixRave

Member
Joined
Apr 17, 2024
Messages
35
Likes
27
That's the experience of my colleagues (in a rather huge example). The internet training data turns out to have lots of nasty systemic bias in it, as you would expect.

I would think that a system trained on internet data is likely to have even more cognitive bias than a troll turning up on ASR insisting they can hear a night and day difference between DACs, cables and op-amps and their wife in the kitchen agrees.
The problem with training a model ourselves is time and computing power/money. Training a model is something we won't be able to do.
 

MatrixRave

Member
Joined
Apr 17, 2024
Messages
35
Likes
27
My feedback:

Mistral 7B
  • Falsely states that PA5 II is "Class A"
  • Makes up stuff about "warm and detailed sound" which wasn't stated by the reviewer
  • Incorrectly talks about issues with durability, which do not relate to the amplifier discussed in the review, but the predecessor
  • Well structured but very superficial and broadly generalized summary with very little substance

LLama3
  • Uses annoying platitudes like "a popular choice among audiophiles"
  • Makes up stuff about "smooth midrange, and tight bass"
  • "Summary" and "Key Points" are very repetitive
  • States "I did not lose much information when summarizing the thread." ???
  • Summary of the "Tower vs Bookshelf Speakers" thread looks broadly correct, but also contains many repetitions

Overall, I don't think the current state gives usable results. If "forced to choose", I would prefer a Mistral 7B, but with shorter outputs and some corrections to the obvious errors. Overall, the summaries appear very superficial and disregard most of the relevant "hard facts" from the main (first) review post of each thread. I think apart from integrating data from the images, there needs to be some directive to summarize the review post first and then add a summary of the dicsussion in a second step.

As already mentioned by others in the thread, one important limitation seems to be the context size. Could something like YaRN be helpful here?
So if I got this right, summaries should be shorter, image data should be included and obviously mistakes need to be reduced. Thanks a lot that's something we can work with.

We are already looking at models with an improved context window, so YaRN could come in handy. Will definitely give it a closer look.
I hope to be able to run a first test with a tuned model until Friday, but since we are both working other classes aswell time is the limiting factor.
 

MatrixRave

Member
Joined
Apr 17, 2024
Messages
35
Likes
27
Have you considered fine-tuning these models? I've heard this gives big improvements especially if you are using the smaller 7B models?
We didn't plan on fine-tuning one ourselves yet since we want to fix some other things first. Next step will be to run tests with improved context windows.
But we keep fine-tuning in mind, thanks for the input.
 

kemmler3D

Major Contributor
Forum Donor
Joined
Aug 25, 2022
Messages
3,594
Likes
7,336
Location
San Francisco
So if I got this right, summaries should be shorter, image data should be included and obviously mistakes need to be reduced. Thanks a lot that's something we can work with.

We are already looking at models with an improved context window, so YaRN could come in handy. Will definitely give it a closer look.
I hope to be able to run a first test with a tuned model until Friday, but since we are both working other classes aswell time is the limiting factor.
IMO summaries don't need to be super short, but they should avoid repetition, and should have low/no mistakes - the most important one to avoid is including ideas that weren't actually present in the threads. The threads that need summarization are often 100s of posts long, so even a full-page summary would be OK as long as it caught the key points at a useful level of detail. If someone could read an LLM thread summary, and the original post, and then participate meaningfully in the ongoing discussion based on that, it would be a real success for this use case IMO.
 

MatrixRave

Member
Joined
Apr 17, 2024
Messages
35
Likes
27
IMO summaries don't need to be super short, but they should avoid repetition, and should have low/no mistakes - the most important one to avoid is including ideas that weren't actually present in the threads. The threads that need summarization are often 100s of posts long, so even a full-page summary would be OK as long as it caught the key points at a useful level of detail. If someone could read an LLM thread summary, and the original post, and then participate meaningfully in the ongoing discussion based on that, it would be a real success for this use case IMO.
Avoiding repetition and hallucination/mistakes is definitely on the agenda. We tried something with usernames and had massive problems with hallucination there so this definitely needs to be fixed.
If I understand you correctly by a full page summary you mean summarizing each page of a thread on its own and then showing the summary for each page right?
 

kemmler3D

Major Contributor
Forum Donor
Joined
Aug 25, 2022
Messages
3,594
Likes
7,336
Location
San Francisco
If I understand you correctly by a full page summary you mean summarizing each page of a thread on its own and then showing the summary for each page right?
No, sorry, I meant that a summary could be a full page in length (say 500+ words) and still be useful, since the underlying threads can be so long.

However, I guess if it helped with the token / context window difficulties, doing a very brief summary of each page of a thread could also be useful.
 

MatrixRave

Member
Joined
Apr 17, 2024
Messages
35
Likes
27
No, sorry, I meant that a summary could be a full page in length (say 500+ words) and still be useful, since the underlying threads can be so long.

However, I guess if it helped with the token / context window difficulties, doing a very brief summary of each page of a thread could also be useful.
Ahh okay now I get what you mean, thanks for clarifying.
Yeah the one page at a time approach is one option if there is no better way to fix the content window issue. Maybe not a single page every time, but a group of pages every time und these get summarized. We will see if this is necessary or a possible solution in the next few days I think.
 

kemmler3D

Major Contributor
Forum Donor
Joined
Aug 25, 2022
Messages
3,594
Likes
7,336
Location
San Francisco
Ahh okay now I get what you mean, thanks for clarifying.
Yeah the one page at a time approach is one option if there is no better way to fix the content window issue. Maybe not a single page every time, but a group of pages every time und these get summarized. We will see if this is necessary or a possible solution in the next few days I think.
A combined summary of X sub-summaries could still be pretty handy.

"The discussion on pages 1-5 focused on these topics: X, Y, Z with a few comments that generated multiple replies: A, B, C. The discussion on pages 6-10 focused on these topics..." still much faster than reading all 10 pages.
 

amirm

Founder/Admin
Staff Member
CFO (Chief Fun Officer)
Joined
Feb 13, 2016
Messages
44,844
Likes
243,341
Location
Seattle Area
So if I got this right, summaries should be shorter, image data should be included and obviously mistakes need to be reduced. Thanks a lot that's something we can work with.
Amazon summary of consumer reviews does that and it works well. It is usually 3 to 4 sentences. The first few are praises and the last are issues some have found (if any).
 

voodooless

Grand Contributor
Forum Donor
Joined
Jun 16, 2020
Messages
10,528
Likes
18,597
Location
Netherlands
I think one of the major issues here, besides the making up of facts as usual, is that these models seem to be tainted by audiophile press, reviews, and ad copies. Consequently, it tries to mimic those things when summarizing forum topics.

I also think it may be important to weigh posts based on who posted (likes/postcount of the poster), and especially the likes on the posts themselves. These things give context about what is deemed important, and therefore worthwhile to note in a summary. Obviously next to a slew of other things :)
 

Billy Budapest

Major Contributor
Forum Donor
Joined
Oct 11, 2019
Messages
1,867
Likes
2,808
Lastly, they both just summarize what type of discussion happened in the thread, not what the prevailing opinions or conclusions might be. While this is valid, it's not very useful. What you'd want to know in a thread summary is where the balance of opinions landed, not only what topics the thread happened to touch on.
This is the main issue with the review, and demonstrates that the AI/LLMs did not understand the assignment. The hallucinations just magnified this problem.
 
OP
NicoVtt

NicoVtt

Member
Joined
Apr 13, 2024
Messages
22
Likes
8
Thank you for your feedback! We really appreciate it! :)

We have reworked our approach and tried to address some of the mentioned issues. We upgraded to the 70 billion parameter model of Llama3 (llama3:70b-instruct) to support the new approach. To solve the problem of generating summaries for a larger thread, despite the limitations of the model context window (input length), we generate smaller summaries and use them to iteratively refine an overall summary in several steps ("refine on map-reduce"). We were able to reduce hallucinations by adapting the model parameters and using llama3:70b over models like Mixtral.

Currently, we are working on improvements regarding a larger summary length and how much attention the model pays to the initial post of a thread. We have postponed the inclusion of images for now in order to reduce complexity and to focus on optimizing the summaries.
These are our current intermediate results, we are working on:

Topping PA5 II Stereo Amplifier Review (47362)

Here is the extended summary:

The forum posts discuss the TP RA3 Rackmount Amplifier, a new product from Topping's sub-brand, which costs $229 and is available on Aliexpress. The amplifier has a remote control and SOTA resistor-based attenuation.

User amirm shares their review of the amplifier, including measurements, and notes that it looks great with a large and highly visible volume level. Another user, Music1969, asks about channel imbalance, but amirm responds that they only have this version and don't measure channel imbalance for power amps as it's not typically an issue.

Paco De Lucia shares their positive experience with the amplifier, using it in a bi-amp setup without any problems. BeerBear suggests that an electronic/digital volume control would be nice to achieve perfect channel balance and avoid scratching.

Antcollinet asks about pairing the amplifier with a DAC that has remote control. Music1969 brings up the issue of channel imbalance again, referencing a previous review of the Topping PA3S, which showed channel balance issues at low volumes. They ask amirm to measure the channel imbalance of the RA3.

Howard416 identifies the amplifier's chip as the TPA3251. Talisman is cautious about voting for the product due to concerns about reliability, citing past issues with Topping products. Yoku-San suggests that if the amplifier proves reliable, it could be part of an interesting entry-level combination with other affordable components.

Additionally, there was a discussion about the slew rate of the amplifier, with Antcollinet mentioning that it's important for a power amp to have a high slew rate to handle dynamic music. He calculates the slew rate based on the chip's specifications and concludes that it should be able to handle most music without issues.

Here is the extended summary:

The forum discussion revolves around two main topics: the ability of audio equipment to reproduce signals accurately and a specific amplifier product review.

In the first part of the discussion, users debate about the speed at which an audio device can reproduce signals within the human audible range (20Hz to 20kHz). User Julf suggests calculating the slew rate by multiplying the full power output voltage with the upper frequency limit. However, User Maxon insists on knowing the specific parameter of slew rate. The discussion continues with corrections and clarifications on how to calculate the slew rate accurately.

In the second part of the discussion, users review and discuss the TP RA3 Rackmount Amplifier, a new product from Topping's sub-brand. The amplifier has a remote control and SOTA resistor-based attenuation. Users share their positive experiences with the amplifier, including measurements and usage in bi-amp setups. Some users raise concerns about channel imbalance and reliability issues, referencing previous reviews of similar products. Others suggest potential pairings with DACs that have remote control and discuss the importance of perfect channel balance.

Overall, the discussion covers technical aspects of audio equipment performance and user experiences with a specific amplifier product.

Tower vs bookshelf speakers (16822)

Here is the extended summary:

The forum discussion revolves around audio equipment and speaker systems. The conversation starts with User cathodeb Wall sharing an unusual recording experience using a dummy head mic and seeking opinions on the matter.

User Emlin humorously remarks that it might depend on how much time one is willing to devote to listening to car Doppler effects, implying that the topic may be niche or require significant dedication.

The conversation then shifts to a discussion about speakers and subwoofers. User sigbergaudio mentions that there are few speakers that can rival a good subwoofer in terms of performance, citing the Revel Salon 2 as an example. They argue that even high-end speakers like the Salon 2 may not be able to produce meaningful output at low frequencies (20Hz), and that adding a subwoofer can enhance the overall sound quality.

User simple6 asks about the importance of listening distance in choosing between tower speakers and bookcase speakers, wondering if their bookcase speakers would be sufficient for a longer listening distance. They express concern about potential distortion at high SPL (sound pressure levels) and whether their two subs could provide enough power to mitigate this issue.

User tuga chimes in with several comments. Firstly, they suggest that the baffle design of floorstanders might affect vertical dispersion, potentially reducing perceived spaciousness, but note that this effect would likely be minimal. They also mention that visual cues, such as the physical presence of floorstanders, could influence one's perception of the sound.

In another comment, User tuga clarifies that the Revel Salon 2 is a rare 4-way speaker with dedicated low- and sub-bass sections, which allows it to produce full-range sound with quality.

Finally, User tuga shares their personal experience with reference speakers, including the B&W F801 and TAD Reference 1, stating that even these high-end speakers could benefit from the addition of a pair of subs or a "swarm" (a large number of smaller subwoofers).

Additionally, Grandeur suggests using a "swarm" of small subwoofers to enhance sound quality. User tuga also recommends considering room acoustics and speaker placement when setting up a home audio system.

Furthermore, the discussion touches on the importance of proper subwoofer setup and calibration, as well as the potential benefits of using multiple subs in a home audio system. Grandeur shares their experience with using two subs to create a more immersive listening experience.

The conversation also explores the topic of room acoustics and how it affects sound quality. User tuga emphasizes the importance of considering room dimensions, speaker placement, and acoustic treatment when setting up a home audio system.

Lastly, the discussion delves into the world of subwoofer setup and calibration, with Grandeur sharing their experience with using the "swarm" technique to enhance sound quality.

Here is the extended summary:

The conversation revolves around the importance of subwoofers in enhancing sound quality and the role they play in speaker systems. The discussion starts with an unusual recording experience using a dummy head mic, but quickly shifts to a debate about speakers and subwoofers. Users argue that even high-end speakers may not be able to produce meaningful output at low frequencies, and that adding a subwoofer can significantly enhance the overall sound quality.

The importance of listening distance is also discussed, with users wondering if bookcase speakers would be sufficient for longer listening distances and whether two subwoofers could mitigate potential distortion at high SPL. The baffle design of floorstander speakers and its effect on vertical dispersion are also mentioned, as well as the influence of visual cues on one's perception of sound.

Several users share their personal experiences with high-end speakers, including the Revel Salon 2, B&W F801, and TAD Reference 1, and believe that even these exceptional speakers could benefit from the addition of a pair of subwoofers or a "swarm" system. Additionally, users discuss the benefits of adding a subwoofer to their systems, citing examples such as the Revel Salon 2, which can produce full-range sound with quality.

The conversation also touches on the topic of audio equipment and speaker systems more broadly, with users sharing their experiences and opinions on various aspects of sound reproduction.
 

RandomEar

Senior Member
Joined
Feb 14, 2022
Messages
344
Likes
826
Very cool to see this progress! Some feedback on the current state:

Summary 1
Here is the extended summary:

The forum posts discuss the TP RA3 Rackmount Amplifier, a new product from Topping's sub-brand, which costs $229 and is available on Aliexpress. The amplifier has a remote control and SOTA resistor-based attenuation.
Clearly, the thread is not about the RA3. The price is also incorrect and seems to be taken from the PA3 review. The stuff about the remote and the resistors is equally unrelated/wrong.

User amirm shares their review of the amplifier, including measurements, and notes that it looks great with a large and highly visible volume level. Another user, Music1969, asks about channel imbalance, but amirm responds that they only have this version and don't measure channel imbalance for power amps as it's not typically an issue.
There is nothing about the volume level (display) in the review, just about a large volume knob. The "only have this version" is not related to the channel imbalance post, but the rest of the sentence is correct.

Paco De Lucia shares their positive experience with the amplifier, using it in a bi-amp setup without any problems. BeerBear suggests that an electronic/digital volume control would be nice to achieve perfect channel balance and avoid scratching.

Antcollinet asks about pairing the amplifier with a DAC that has remote control. Music1969 brings up the issue of channel imbalance again, referencing a previous review of the Topping PA3S, which showed channel balance issues at low volumes. They ask amirm to measure the channel imbalance of the RA3.
I don't think anybody asked amir to measure the channel imabalance of the RA3 in that thread. The rest looks broadly correct.

Howard416 identifies the amplifier's chip as the TPA3251. Talisman is cautious about voting for the product due to concerns about reliability, citing past issues with Topping products. Yoku-San suggests that if the amplifier proves reliable, it could be part of an interesting entry-level combination with other affordable components.
Looks broadly correct.

Additionally, there was a discussion about the slew rate of the amplifier, with Antcollinet mentioning that it's important for a power amp to have a high slew rate to handle dynamic music. He calculates the slew rate based on the chip's specifications and concludes that it should be able to handle most music without issues.
Somewhat mixed up, but still close.


Summary 2
Here is the extended summary:

The forum discussion revolves around two main topics: the ability of audio equipment to reproduce signals accurately and a specific amplifier product review.
I don't think that this is a fitting summary of those 45+ pages.

In the first part of the discussion, users debate about the speed at which an audio device can reproduce signals within the human audible range (20Hz to 20kHz). User Julf suggests calculating the slew rate by multiplying the full power output voltage with the upper frequency limit. However, User Maxon insists on knowing the specific parameter of slew rate. The discussion continues with corrections and clarifications on how to calculate the slew rate accurately.
The slew rate stuff is really only present on the last two pages, as far as I could see. In a three paragraph summary, it probably shouldn't be mentioned at all.

In the second part of the discussion, users review and discuss the TP RA3 Rackmount Amplifier, a new product from Topping's sub-brand. The amplifier has a remote control and SOTA resistor-based attenuation. Users share their positive experiences with the amplifier, including measurements and usage in bi-amp setups. Some users raise concerns about channel imbalance and reliability issues, referencing previous reviews of similar products. Others suggest potential pairings with DACs that have remote control and discuss the importance of perfect channel balance.

Overall, the discussion covers technical aspects of audio equipment performance and user experiences with a specific amplifier product.
There is a lot of discussion around the RA3 and what parts are similar or different, but the thread overall is still a review of the PA5 II and the summary fails to address this. It also doesn't mention amir's main review post.

---

I don't know what the difference is between the first and the second summary (different models or just different parameters?), but in this case, the first one is clearly superior. It does still contain a lot of falsely attributed information, but it's at least words/info chunks that were present somewhere in the thread and have simply been taken out of context by the model. It's way less "made up shit" than with the original, smaller models. I think this is going in the right direction.
 
Last edited:
Top Bottom