Here's a (kinda) ELI5: you would use a language model to create "embeddings" of the text, which you can think of as a set of numbers representing the "meaning" of a set of characters.
These numbers can be plotted as points in a space, and embeddings of things with similar meanings are plotted close to each other. So things like "exam preparation" would have embeddings close to things like "top study tips".
Say you have created embeddings for a large corpus of text (in this case all youtube captions) once. If you create embeddings for a user query, you can search for embeddings close to it, and these will be "semantically" similar to the query.
The advantage is that unlike traditional full-text search, the user doesn't need a query that includes words present in the text.
Yes in theory although they are pretty expensive. I am doing something like this at work as I wanted to unlock the wealth of information we have in our tutorials, webinars etc.