The historic picture makes a little more sense (though this is not something a 5yo would understand).
We call these things embeddings because you start with a very high dimensional space (image a space with one dimension per word type, where each word is a unit vector in the appropriate dimension) and then approximate distances between sentences / documents / n-grams in this space using a space with much smaller dimensionality. So we "embed" the high dimensional space in a manifold in the lower dimensional space.
It turns out though that these low dimensional representations satisfy all sorts of properties that we like which is why embeddings are so popular.
A word embedding transforms a word into a series of numbers, with the property that similar words (e.g. "dog" and "canine") produce similar numbers.
You can have embeddings for other things, such as pictures, where you would want the property that e.g. two pictures of dogs produce more similar numbers than a picture of a dog and a picture of a cat.
It is indeed a vector space. You don't really choose a basis, an ML tool like word2vec [1] does. And like most advanced applications of ML, exactly how it works is a mystery.
> The reasons for successful word embedding learning in the word2vec framework are poorly understood. Goldberg and Levy point out that the word2vec objective function causes words that occur in similar contexts to have similar embeddings (as measured by cosine similarity) and note that this is in line with J. R. Firth's distributional hypothesis. However, they note that this explanation is "very hand-wavy" and argue that a more formal explanation would be preferable.