• WANTED: Happy members who like to discuss audio and other topics related to our interest. Desire to learn and share knowledge of science required. There are many reviews of audio hardware and expert members to help answer your questions. Click here to have your audio equipment measured for free!

Summarization of threads using LLMs (AI)

NicoVtt

Member
Joined
Apr 13, 2024
Messages
25
Likes
15
Hello everyone,
we are a team of two computer science master students from Germany. As part of an AI and deep learning project, we would like to look into your idea of using a large language model to summarize reviews in the forum. We have read the initial discussion in "Using AI for review thread summaries".

We already started working on the project and hope for your feedback to develop the best possible solution for the forum. Therefore we summarized the threads "Topping PA5 II Stereo Amplifier Review" and "Tower vs bookshelf speakers" using Llama 8B model and the Mistral 7B model. Please keep in mind that the Llama 3 version is only the small one and therefore not as powerful as the larger model. Different from GPT both models are free to use and can be run locally on your own machine which is what we did to create the summaries presented in this thread. By using the bigger model results should improve.

We would like to ask you to read the following summaries and provide feedback on what you think about them. They are the first attempt of our approach. At the moment we would like to focus on things like length and details, so please let us know what you think about both aspects and also what you think about the topics covered by the summaries. If we are missing out on anything important please let us know. We are happy about any constructive feedback. Your feedback is critical to improve the prompts to generate the summaries so feel free to participate. If you have any more specific questions feel free to ask them in the thread.

Thanks to @amirm, who had an open ear for our questions about the forum and kindly provided us with the necessary details! :)

Thank you in advance for your feedback and opinion!

Fabian (@MatrixRave) and Nico (@NicoVtt)


Mistral 7B:
Title: Topping PA5 II Stereo Amplifier Review - A Discussion on Sound Quality, Design, and Performance

Thread Overview:
In thread 47362 on the Audio Science Review (ASR) forum, user "amir" initiated a review discussion about the Topping PA5 II stereo amplifier. This thread attracted various opinions from other forum members regarding their experiences with this amplifier model and its performance characteristics.

Keypoints:
1. Amplifier Overview:
The Topping PA5 II is a Class A integrated amplifier designed by Topping Technologies Co., Ltd. It is known for its compact size, high power output, and low distortion.

2. Sound Quality Discussion:
- amir shared their impression of the Topping PA5 II's sound quality, noting that it delivers a warm and detailed sound with good bass response.
- Several forum members echoed these sentiments, sharing their agreement on the amplifier's ability to provide clear and dynamic sound while maintaining an excellent balance between various frequencies.

3. Design and Build Quality:
- amir praised the Topping PA5 II's minimalist design and elegant construction, which they found visually appealing and well-suited for use in a Hi-Fi system.
- Some users commented on the build quality of their units, reporting that they have experienced no issues with durability or reliability.

4. Power Handling and Performance:
- amir mentioned that the Topping PA5 II can deliver up to 2x30W into 8Ω with both channels driven, making it a powerful amplifier for its size.
- Users shared their experiences with the amplifier's performance in various situations, including driving challenging speaker loads and handling complex audio signals.

5. Comparison with Other Amplifiers:
- Some forum members compared the Topping PA5 II with other amplifiers they have used or owned, discussing aspects such as sound quality, power handling, and price point.

6. Recommendations and Concluding Thoughts:
- Several users shared their recommendations for those considering purchasing the Topping PA5 II, highlighting its value proposition for the price and overall performance.
- The thread concluded with various positive comments on the amplifier and expressions of satisfaction from the forum members who have used it.

Title: Tower vs Bookshelf Speakers (Thread 16822) at Audio Science Review Forum

Overview:
In this thread from the Audio Science Review forum, user cathodeb initiated a discussion comparing tower and bookshelf speakers. The conversation covers various aspects of these two speaker types, including size, sound quality, aesthetics, and room suitability.

Keypoints:
1. Size and Space Considerations:
- Cathodeb mentioned that towers require more space than bookshelves due to their larger dimensions.
- User gonzocabrera noted that towers offer better bass response because of their larger size and the ability to house larger woofers.

2. Sound Quality and Performance:
- Cathodeb questioned if the performance difference between tower and bookshelf speakers justifies the space required for towers.
- User bhartley51 provided examples of high-performance bookshelf speakers that could rival towers in sound quality.

3. Aesthetics:
- Cathodeb expressed a preference for the look and style of tower speakers over bookshelves.
- User jms_mack pointed out that bookshelf speakers can be arranged to create an aesthetic, cohesive setup.

4. Room Suitability and Placement:
- Cathodeb asked if room size influenced the decision between towers and bookshelves.
- User djw78 responded that both types could work well in various spaces depending on the specific acoustics of the room and personal preferences.

Thread Summary:
In summary, this thread discussed the differences and similarities between tower and bookshelf speakers. Cathodeb initiated the conversation by expressing a preference for towers based on their size and sound quality. However, other forum members highlighted the advantages of bookshelves, such as smaller size, better aesthetics, and comparable sound performance. The thread concluded with a reminder that both types have their merits depending on personal preferences and room conditions.


Llama 3:
The Topping PA5 II Stereo Amplifier Review thread discusses the performance and features of the Topping PA5 II stereo amplifier, a popular choice among audiophiles. The thread is a review of the amplifier's sound quality, build, and functionality.

**Key Points**

* **Summary**: This section will provide a brief overview of the thread.
+ The reviewer praises the Topping PA5 II for its excellent sound quality, with clear and detailed highs, smooth midrange, and tight bass.
+ The amplifier is also commended for its build quality, with a solid aluminum chassis and sturdy construction.
+ The reviewer notes that the amp's power output is sufficient for driving most speakers, but may not be suitable for extremely demanding loads.
* **Key Points**: This section will highlight the main points discussed in the thread.
+ Sound quality: The amplifier produces clear highs, smooth midrange, and tight bass, making it suitable for a wide range of music genres.
+ Build quality: The Topping PA5 II has a solid aluminum chassis and sturdy construction, giving it a premium feel.
+ Power output: The amp's power output is sufficient for driving most speakers, but may not be suitable for extremely demanding loads.
* **Additional Images**: This section will include any additional images or diagrams relevant to the thread.

Please note that I did not lose much information when summarizing the thread. If you would like me to provide more details or specific quotes from the review, please let me know!

**Thread Summary**

The thread "Tower vs Bookshelf Speakers" discusses the advantages and disadvantages of tower speakers versus bookshelf speakers. The discussion highlights key differences in sound quality, room size considerations, and personal preferences.

**Key Points**

* **Summarization**
+ Tower speakers tend to provide better bass response and a more immersive sound experience.
+ Bookshelf speakers are often preferred for their compact size and ease of placement.
+ Room size plays a significant role in determining which type of speaker is suitable.

**Thread Summary**

The thread "Tower vs Bookshelf Speakers" explores the differences between tower speakers and bookshelf speakers. Key points include:

* Tower speakers:
+ Tend to have larger woofers, resulting in better bass response.
+ Can fill larger rooms with ease.
+ Often require more power from an amplifier.
* Bookshelf speakers:
+ Are compact and easier to place in smaller rooms.
+ Typically require less power from an amplifier.
+ May not provide the same level of bass response as tower speakers.

**Additional Insights**

Cathodeb shares their experience with a bookshelf speaker setup, stating that it provides "a great sense of immersion" despite its compact size. Another user, 16029, notes that they prefer bookshelf speakers for their ease of placement and ability to be positioned at ear level.

Overall, the thread highlights the importance of considering room size and personal preferences when deciding between tower and bookshelf speakers.
 
Hello everyone,
we are a team of two computer science master students from Germany. As part of an AI and deep learning project, we would like to look into your idea of using a large language model to summarize reviews in the forum. We have read the initial discussion in "Using AI for review thread summaries".

We already started working on the project and hope for your feedback to develop the best possible solution for the forum. Therefore we summarized the threads "Topping PA5 II Stereo Amplifier Review" and "Tower vs bookshelf speakers" using Llama 8B model and the Mistral 7B model. Please keep in mind that the Llama 3 version is only the small one and therefore not as powerful as the larger model. Different from GPT both models are free to use and can be run locally on your own machine which is what we did to create the summaries presented in this thread. By using the bigger model results should improve.

We would like to ask you to read the following summaries and provide feedback on what you think about them. They are the first attempt of our approach. At the moment we would like to focus on things like length and details, so please let us know what you think about both aspects and also what you think about the topics covered by the summaries. If we are missing out on anything important please let us know. We are happy about any constructive feedback. Your feedback is critical to improve the prompts to generate the summaries so feel free to participate. If you have any more specific questions feel free to ask them in the thread.

Thanks to @amirm, who had an open ear for our questions about the forum and kindly provided us with the necessary details! :)

Thank you in advance for your feedback and opinion!

Fabian (@MatrixRave) and Nico (@NicoVtt)


Mistral 7B:





Llama 3:
Hmm, I think there's a long way to go with the accuracy and usefulness of these outputs.

First, the "main attraction" of Amir's reviews is in the information in the images. Basically the various details in the graphs, which are sometimes summarized further in text, sometimes not. So if the LLM is not parsing images then it's not going to help much beyond summarizing discussion OF the review, it won't be able to summarize the review itself.

Secondly, just looking at Mistral's output on the PA5II review, it makes false statements almost immediately. Amir didn't praise the sound quality (there was no subjective / listening test by Amir at all, from what I can see) and it misunderstands his use of the word "minimalist". It says he compliments the minimalist design - no. He mentioned that it would be a suitable piece of gear for a minimalist (read: using few components) system.

The LLama summary is even worse. It just seems to insert boilerplate audiophile review language - "The reviewer praises the Topping PA5 II for its excellent sound quality, with clear and detailed highs, smooth midrange, and tight bass." These are common phrases ... that don't appear in the review at all.

Lastly, they both just summarize what type of discussion happened in the thread, not what the prevailing opinions or conclusions might be. While this is valid, it's not very useful. What you'd want to know in a thread summary is where the balance of opinions landed, not only what topics the thread happened to touch on.

I am not sure if the prompts you've used are inadequate, or the LLMs are not up to the task, but these summaries are worse than useless, frankly. They have very superficial true statements, and flat-out lies, and that's about it.

I do appreciate the effort you are putting in here, though.
 
Curious to what made you choose these 2 LLMs vs other models?
We also considered other options, but these were the most promising results we got at this early stage. For example, we tried different versions of GPT, but that would be very expensive as the API is rate-limited and therefore requires a subscription to not run in any limits.

Llama is promising and performs very well on many tasks and has comparable results to GPT 3.5 and sometimes even 4. We currently use the smaller versions because we can run them locally on our own machines. The 70B version of Llama3 requires much more computing power and is therefore more difficult to run locally, but we are looking to upgrade to the 70B model soon. We are also investigating other models, but have not yet obtained promising results (Llama2, for example, performs even worse than Llama3).
 
Hmm, I think there's a long way to go with the accuracy and usefulness of these outputs.

First, the "main attraction" of Amir's reviews is in the information in the images. Basically the various details in the graphs, which are sometimes summarized further in text, sometimes not. So if the LLM is not parsing images then it's not going to help much beyond summarizing discussion OF the review, it won't be able to summarize the review itself.

Secondly, just looking at Mistral's output on the PA5II review, it makes false statements almost immediately. Amir didn't praise the sound quality (there was no subjective / listening test by Amir at all, from what I can see) and it misunderstands his use of the word "minimalist". It says he compliments the minimalist design - no. He mentioned that it would be a suitable piece of gear for a minimalist (read: using few components) system.

The LLama summary is even worse. It just seems to insert boilerplate audiophile review language - "The reviewer praises the Topping PA5 II for its excellent sound quality, with clear and detailed highs, smooth midrange, and tight bass." These are common phrases ... that don't appear in the review at all.

Lastly, they both just summarize what type of discussion happened in the thread, not what the prevailing opinions or conclusions might be. While this is valid, it's not very useful. What you'd want to know in a thread summary is where the balance of opinions landed, not only what topics the thread happened to touch on.

I am not sure if the prompts you've used are inadequate, or the LLMs are not up to the task, but these summaries are worse than useless, frankly. They have very superficial true statements, and flat-out lies, and that's about it.

I do appreciate the effort you are putting in here, though.
Thank you for your feedback.

As this is the first phase of the development process, we have focused on text messages and have not yet included images. We are aware that these are an important part of many reviews and are currently working on a way to include them in one of the next steps. However, this may take some time as we are currently working on another issue that is also important.

When it comes to how the model interprets the input text, this can always be a problem. As the model has not been trained on Amir's writing style, there is always room for error. This will be investigated further to see if there is a way to prevent or reduce this behavior.

Prompts, like most things at this stage, are still in development. We have worked our way from unstructured standard text to some structure in the output. Currently, I suspect it's a combination of a prompt not being developed well enough and a model/data not yet up to the task. We will take a closer look to see if there is a way to fix this issue quickly.
 
Thank you for your feedback.

As this is the first phase of the development process, we have focused on text messages and have not yet included images. We are aware that these are an important part of many reviews and are currently working on a way to include them in one of the next steps. However, this may take some time as we are currently working on another issue that is also important.

When it comes to how the model interprets the input text, this can always be a problem. As the model has not been trained on Amir's writing style, there is always room for error. This will be investigated further to see if there is a way to prevent or reduce this behavior.

Prompts, like most things at this stage, are still in development. We have worked our way from unstructured standard text to some structure in the output. Currently, I suspect it's a combination of a prompt not being developed well enough and a model/data not yet up to the task. We will take a closer look to see if there is a way to fix this issue quickly.
Thanks for the reply. I would add: Given that you are also using older/simpler LLMs I assume you also need to be wary of token limits? Last year I had real problems with text summarization tasks (interview transcripts) in that after the token limit is reached, the LLM wouldn't throw an error but would simply generate something related to, but not based on the text at all. It was reminiscent of what we got from LLama here.

Also, I'm not an expert in this space, but I think we are pushing LLMs to do something they're not truly capable of here. Summarizing text also implies fidelity to the facts in the text. While some facts can be omitted, none may be changed or added. I don't know that LLMs can actually work like that. Even the products commercialized by big tech companies include disclaimers about factual accuracy.

I think the very specific context of this forum is also probably a difficulty. For example, even a human wouldn't know "minimalist setup" referred to the number of components in the setup, instead of the visual design, unless they were familiar with how language is used on audio forums. There's a lot of nuance in audio discussions, it's even problematic for enthusiasts.

For example, depending on who is writing and the context of the review, and the context of the specific comment, saying an amp sounded "bright" could either be an extremely negative comment, or a slightly positive one, or even a meaningless one.

Anyway, I am sure you are aware of all of all of this. Just commenting on what a difficult task you've taken on here. I certainly wish you luck! If you can get good, accurate summaries of audio forum threads, I think from there, nothing is impossible. :) This is something like Waymo training their self-driving cars on the streets of San Francisco.
 
Last edited:
Thanks for the reply. I would add: Given that you are also using older/simpler LLMs I assume you also need to be wary of token limits? Last year I had real problems with text summarization tasks (interview transcripts) in that after the token limit is reached, the LLM wouldn't throw an error but would simply generate something related to, but not based on the text at all. It was reminiscent of what we got from LLama here.

Also, I'm not an expert in this space, but I think we are pushing LLMs to do something they're not truly capable of here. Summarizing text also implies fidelity to the facts in the text. While some facts can be omitted, none may be changed or added. I don't know that LLMs can actually work like that. Even the products commercialized by big tech companies include disclaimers about factual accuracy.
Token limits are one of the main problems we are currently working on. Token limits seem to be a general problem at the moment. Apart from OpenAI's GPT, all models are limited as far as I know. OpenAI is out of the question for us as the API is rate-limited and the subscriptions we would need are very expensive. Apart from that GPT4 Turbo for example only has a context window of 128k tokens.
We are already looking at different options to extend the context window, but haven't had the time to implement a new version to test it yet. Nevertheless, we wanted to go public with the first result, because in our experience community feedback is important.
Another reason for one of the "smaller" LLMs is that we currently run them locally on our computers, which have limited computing power. But none of these models are "old". Upgrading to Llama3 70B should improve the results generated compared to what I have experienced with other types of data.
 
Hmm, I think there's a long way to go with the accuracy and usefulness of these outputs.
That is very much my impression too.
I have read neither of the two threads that are used here, so I can only judge from the "output"here.
I have a clear preference: Llama 3 - Why? Because it is shorter.
Maybe it is the nature of these threads but the summaries are a combination of stating the obvious [bookshelf is smaller than tower], generic remarks, (unfounded) opinions [Several users shared their recommendations...] and some crude mistakes sprinkled in.
I could not find any relevant information beyond what I would find in the review post (amir). So this seems like the much better summary.
[one observation: in the PA5 summary there is no mention about load dependency beyond a very superficial remark. But this is an important point for class d amp that was mentioned in the review and certainly discussed later too]

Maybe these threads are not the best task for summaries anyway.
There are other threads that would urgently need a summary like the Directiva (DIY speaker project) thread(s). There information is spread about the dozens of pages with hundreds of posts and it is a pain to find something. But the information has to be correct of course and there a problems as @kemmler3D pointed out.

EDIT: Maybe it would be worth considering to just have an index for such a (collection of) threads so one could find the relevant post/information more easily.

My impression is, these language models are overwhelmed by the task. But I am no expert at all and see only the result.
 
Last edited:
I assume you know about LLM leaderboard websites?

1. https://eqbench.com/
2. https://chat.lmsys.org/
Select `Leaderboard` on top to see the results.

I personally use 'Meta-Llama-3-70B-Instruct', the top-ranked open-source model.
I'm lazy to run it locally, so I just run it from https://labs.perplexity.ai/
We are aware of leaderboards and the APIs aswell.
Problem is that the API only allows us to use certain models which are not optimized which will be our next step.
Another problem with APIs in the past was that the input we need to provide was too large and wasn't accepted by the API anymore.
Therefore we switched to a local setup.

We plan on working with the Llama3 70B version soon and are looking for ways to improve the context window.
This seems to be a major issue when it comes to summarizing larger threads.
 
That is very much my impression too.
I have read neither of the two threads that are used here, so I can only judge from the "output"here.
I have a clear preference: Llama 3 - Why? Because it is shorter.
Maybe it is the nature of these threads but the summaries are a combination of stating the obvious [bookshelf is smaller than tower], generic remarks, (unfounded) opinions [Several users shared their recommendations...] and some crude mistakes sprinkled in.
I could not find any relevant information beyond what I would find in the review post (amir). So this seems like the much better summary.
[one observation: in the PA5 summary there is no mention about load dependency beyond a very superficial remark. But this is an important point for class d amp that was mentioned in the review and certainly discussed later too]

Maybe these threads are not the best task for summaries anyway.
There are other threads that would urgently need a summary like the Directiva (DIY speaker project) thread(s). There information is spread about the dozens of pages with hundreds of posts and it is a pain to find something. But the information has to be correct of course and there a problems as @kemmler3D pointed out.

EDIT: Maybe it would be worth considering to just have an index for such a (collection of) threads so one could find the relevant post/information more easily.

My impression is, these language models are overwhelmed by the task. But I am no expert at all and see only the result.
Hi, thank you very much for your support.

We took this assignment because we are working on it for a project in one of our lectures.
To be honest, we picked random topics to summarize for these tests, so yes, maybe there are other topics that would be better to work on from the beginning.
As mentioned, the length of the context window really seems to be an issue here, which I think is why the LLM is starting to add incorrect information and chunks of text.
 
Are you running any custom prompts or memories (as ChatGPT calls it)?
For example, telling the model about Amir, his background, expertise, etc, in order to help tailer the response?
We are working with custom prompts, but not yet with background information about Amir and not with memories.
But the idea seems good to me, so we will pursue it further. Maybe more information about Amir and his writing style can help the model understand him better and reduce the number of errors.
The only problem that comes to my mind is that the model might have difficulties understanding the answers of other forum members.
 
the context window.
This seems to be a major issue when it comes to summarizing larger threads.
Indeed. The context for a given comment often goes back a few pages, in many cases it refers back to the first post (review by Amir) so with long threads, I imagine this gets pretty difficult.

the model might have difficulties understanding the answers of other forum members.
You could give it some general thoughts about forum members, which would be true for perhaps 8/10 commenters, but some commenters have the exact opposite POV and tendencies to the 80%, so that seems tricky also.

With no token limit, you could write prompts that include a short textbook introduction to audio and audio technology. :D Short of that, I think getting it to understand the nuance and context of audio reviews will be tricky.

I applaud you for bringing your outputs to the forum early. We are a harsh and exacting bunch, I know my first comment probably came off like that. But honestly, if you can make headway here, it would be of great help. You don't have to look far to find threads that are too long to reasonably catch up on. So perhaps a running (condensed) commentary rather than a fully summary could be helpful and more tractable.

For example: "On pages 4-15, members expressed incredulity about Eric Alexander of Tekton and his approach to reviewers. On page 16, someone mentions a relevant court case [link]. On pages 17-32, members continued to express incredulity about Eric Alexander of Tekton and his approach to reviewers." :)
 
Indeed. The context for a given comment often goes back a few pages, in many cases it refers back to the first post (review by Amir) so with long threads, I imagine this gets pretty difficult.


You could give it some general thoughts about forum members, which would be true for perhaps 8/10 commenters, but some commenters have the exact opposite POV and tendencies to the 80%, so that seems tricky also.

With no token limit, you could write prompts that include a short textbook introduction to audio and audio technology. :D Short of that, I think getting it to understand the nuance and context of audio reviews will be tricky.
There are models that can work with much larger context windows but therefore the quality of the output suffers. We will run some tests with them in the near future so we can get an idea if a larger context window is worth the quality loss.

The textbook is a good idea but sounds more like a RAG solution to me. Not sure if we can get one up and running in the time we have and if that's the right solution for this issue but definitely worth a look so thanks for brining it up.
 
There are models that can work with much larger context windows but therefore the quality of the output suffers. We will run some tests with them in the near future so we can get an idea if a larger context window is worth the quality loss.

The textbook is a good idea but sounds more like a RAG solution to me. Not sure if we can get one up and running in the time we have and if that's the right solution for this issue but definitely worth a look so thanks for brining it up.
I'm talking out of my ear now, but can you keep the original post in the context window indefinitely? Or otherwise set it dynamically depending on which posts are quoted, etc.?
 
I'm talking out of my ear now, but can you keep the original post in the context window indefinitely? Or otherwise set it dynamically depending on which posts are quoted, etc.?
That's something we currently can't tell exactly and are trying to figure out. We didn't find any documentation or information on this yet so it's testing and trying at this point.
We are looking for research papers hoping that one picks up these points and we can get a better understanding of how context windows work in different models.

We have some other ideas in mind to deal with this issue but please understand that we don't talk about everything yet (some ideas we just wrote down for now and didn't think about them any further).
 
This is a bit off-topic…

What about creating an ASR LLM chat-bot, to answer the most frequent asked questions related to audio?

Inspired by thread:

This sounds like a lot more work compared to just a summary.

But I see the most benefit from doing something like this, since it will save Amir and the top contributors from repeating themselves a lot.

I find myself answering the same questions over and over, especially over private messages, which I don’t mind doing, but I can see the benefit of automating this with a chat-bot.

Anyway, just thinking out loud.
 
First, the "main attraction" of Amir's reviews is in the information in the images.

Agreed. Hopefully the LLM will generate better images than what happened to the poor sods below:

nuclearfamily.png


There is a slight chance that they live a little too close to Three Mile Island . . .
 
Agreed. Hopefully the LLM will generate better images than what happened to the poor sods below:
The thing is the LLM needs to be able to interpret the image not generate a new one.
If it starts generating new images to reviews Amir or some of the others wrote, it's very likely to start hallucinating and giving false information in my opinion.
If everything goes as planned we should be able to interpret images given in threads in the near future, which will hopefully improve the LLMs output.
 
Back
Top Bottom