From my one voluntary interaction with ChatGPT, I was impressed by how its programmers/data feeders/care takers had integrated hard data into its repertoire (I asked it about yield strains for different grades of steel bolts), but also by how easily it confused itself with the meaning it had assigned to numbers/variables. Like, in one moment it would pick an arbitrary number as an example for yield stress, to illustrate the concept, and two sentences later it was mistaking that number for the yield stress in a computation. Impressive was again that it correctly parsed my comment pointing out this mistake, and corrected its calculations and spit out a corrected table of yield strains.
But unless a particular instance of this program has been configured to put a very high emphasis on not making stuff up, and a much lower emphasis on spitting results out to keep the 'conversation' going, I don't see how you can use a tool like this without cross checking every bit of output. From my interaction with a particular instance of one of these programs, I would fully expect it to make up correlations between say speaker sensitivity and predicted ranking if it 'got the impression' that was something you might like to hear.