Mostly an ML/NLP engineer, and been interviewing a lot lately, and therefore have been doing a lot of NLP take home tests lately, some more complex than others. Certainly prefer it over any LeetCode tests, though some of these take homes are so complex, they verge on pro bono work and I often feel I should be compensated.
One organization recently gave me a take home with the following demands:
--
Here are 150k long-form (6000+ words) documents, and a list of labels.
Please use a recent transformer model as the vectorization/representation layer to train a multilabel classifier on this data set.
You can use CoLab and their free GPU tier, but we won't pay for any GPU/TPU time.
Also, please compare this solution to other algorithms (linearSVM, XGBoost) and write 1000 words about the performance tradeoffs.
---
I'm not a deep learning expert, and I assumed that transformer models are basically limited to short form text with the exception of the Longformer and Big Bird architecture, and I was under the impression that those solutions were pretty memory intense. Other solutions are limited to only looking at the first 512 characters or so. And I'm not even sure CoLab's free tier can handle this.
Is this too much? Part of me is really excited to try this but another part of me is already imagining the compute time/space required to run this thing.
That means plugging in the data to the tool they use all the time, adjusting a few parameters, and writing the 1000 words while waiting for the results…that’s possible because they know the tradeoffs they made as they made them.
If it looks like a hard and interesting challenge, that desired candidate isn’t you.
Nothing wrong with that.
They want someone who can knock this class of problems out before lunch on a Tuesday when the boss asks at 10am.