Hacker News new | past | comments | ask | show | jobs | submit login

Interesting concept, but how will it work with more dynamic content? You can train the model on a fairly static corpus such as Wikipedia, but what if you content changes with a greater frequency?

Since MapReduce is used, perhaps the model is already being trained on small batches making incremental updates possible.




Hi, creator here (Chris Moody). Great question. The underlying algorithm, word2vec, (https://code.google.com/p/word2vec/) isn't built for streaming data which means that at the moment it assumes a fixed number of words from the beginning of the calculation. Unfortunately, until the state-of-the-art advances to accepting streaming data, the whole corpus will have to be rescanned to accept dynamic content. Furthermore, word2vec doesn't scale past OpenMP, single node, shared-memory resources. So while I used MapReduce, it's just for cleaning and preprocessing the text, not training the vectors, which is the hard part.

So there's some exciting work to be done in parallelizing and streaming the word2vec algorithm!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: