So I've hearing a lot that Deepseek's open source AI uses reinforcement learning
I've been watching so short videos to help explain it but struggling to understand.
So I understand it is a trial and error method and reward system - but how does it know it has reached the correct answer ?
How do we know and how does it know, that it hasn't given us gibberish as an answer ?
Apparently it doesn't use human feedback in its training (that is costly and inefficient)
Seems a bit mysterious to me.
Can anyone dumb it down? Specifically Deepseek
This guy says he reverse engineered it, so I assume the answer is in here and I didn't understand the key part
I've been watching so short videos to help explain it but struggling to understand.
So I understand it is a trial and error method and reward system - but how does it know it has reached the correct answer ?
How do we know and how does it know, that it hasn't given us gibberish as an answer ?
Apparently it doesn't use human feedback in its training (that is costly and inefficient)
Seems a bit mysterious to me.
Can anyone dumb it down? Specifically Deepseek
This guy says he reverse engineered it, so I assume the answer is in here and I didn't understand the key part