Amir, you're into photography, right?
If I asked you: "what is the correct shutter speed?" Wouldn't you answer: "it depends." (...on lighting conditions, subject motion, aperture, film/sensor speed/sensitivity...)
Switching time and sample time in a listening test depend on what you are testing and how you are testing. Your blanket statement doesn't work, just as "I've taken tons of photos and 1/500 sec or shorter works best. Anything longer and all my daylight race photos would be all blurry" doesn't work for all photos.
Taking a picture is about creating art. The job of audio reproduction is to preserve art. They are different things.
In photography people will take a picture that represents 1/500 sec and examine each and every pixel and determine lens and camera sensor quality. And from point of view of examining a test chart, any shutter speed that doesn't cause a blur would be fine (or locked mirror).
Now take that same 1/500 sec shot and put it in a video of fast moving action and ask people to tell the difference between two lenses of identical focal length (i.e. "look") and they won't be able to ascertain the above differences. Indeed lossy compression of video creates tons of poor quality frames but because they change rapidly, we can't tell that is happening. Of course we can freeze the video and then examine the fidelity -- something we can't do with audio.
Analogies aside, let's look at the task at hand. Let's say I want to tell the difference between 320 kbps lossy compression and original. If we had a tool to objectively analyze the lossy file, we would see that the fidelity is all over the place. Let's say I have pure, absolute silence. In that case I don't need a fraction of bandwidth that CD provides to represent it. On the other hand, if I have tons of transients, throwing away 75% of the bits will cause large degradations (objectively).
Lossy compression works on a frame by frame basis. The frames are measured in milliseconds. A problem, i.e. a transient, may be there in one of those frames but not in the others. By capturing that transient and listening to it over and over again, you take advantage of much higher fidelity short-term memory to hear the difference between that, and the alternative.
In a larger context, "masking" is the enemy of hearing distortions. A lot of sins can be buried in the power of music itself. So the trick to hearing those artifacts is to find components -- which may be awfully short -- that have sufficient silence before/after or during of maybe a single note where that distortion becomes audible. A pluck of acoustic guitar is my favorite example here.
Stepping back, we have to consider why we use listening tests instead of measurements. If I have equipment that truncates everything above 10 Khz, then an instrument telling us that is sufficient to know there is a problem. If however a transient is distorted ever so slightly, our classic measurements don't reveal that -- not easily anyway. Hence the reason we don't use instrumentation for determining lossy compression artifacts. We use humans and weapon of choice there is short switching time.
Indeed fastest way to "lie" about fidelity of lossy compression is to not give that ability to listeners and ask them if two files are different. You will get far more votes of "it is the same as CD" than if you provided that critical snippet and ability to loop there.
I say all this stuff from training and doing such work. And having others in the industry practice exactly the same thing.
Take Harman speaker testing. Their switching time is about 4 seconds. I have to tell you, even though speakers have large sonic differences, 4 seconds was excruciatingly long. During that pause you keep trying to refresh in you memory of what you heard and watch it fade away before the next speaker plays. As such for hearing small differences like distortion, 4 seconds is completely unacceptable. It is OK though for hearing large differences in timbre/overall sound of speakers which is what Harman uses them for.