Echoic memory is also very useful when someone says something, but you haven't understood them right away. There is also iconic memory, which stores the things you see and lasts less than a second.
I have echoic memory for audio, and that's indeed very useful.
But I don't have 'playback' memory for anything else, definitely not for visuals or touch. So if something is said around me and I didn't listen, I can replay the last couple of seconds or so, and that's usually enough. Helps with languages you're not fluent in too. But if I suddenly notice that I'm now touching something that I shouldn't, say, for example with my arm, in a crowded pub, there's just no way I can 'replay' history, not even the last moments, to figure out how that happened. Unless I actually paid attention when it happened. Same with visuals. If I didn't recognize what passed before my eyes there's no way to replay that to take a better look. Unlike with audio, where I can do precisely that.