The multistable perception of speech refers to the perceptual changes experienced while listening to a speech form cycled in rapid and continuous repetition, the so-called Verbal Transformation Effect. Because distinct interpretations of the same repeated stimulus alternate spontaneously, this effect provides an invaluable tool to examine how speech percepts are formed in the listener's mind. In order to track the temporal dynamics of brain activity specifically linked to perceptual changes, intracerebral EEG activity was recorded from two implanted epileptic patients while performing a verbal transformation task. To this aim, they were asked to carefully listen to a speech sequence played repeatedly and to press a button whenever they perceived a change in the repeated utterance. For both patients, 300-800 ms prior to the reported perceptual transitions, high frequency activity in the gamma band range (>40 Hz) was observed within the left inferior frontal and supramarginal gyri. An additional auditory decision task was used to rule out the possibility that the increased gamma band activity was due to the patients' motor responses. These results suggest that articulatory-based representations play a key part in the endogenously driven emergence of auditory speech percepts. The findings are interpreted in relation to theories assuming a link between perception and action in the human speech processing system.