The readme says "We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications."
We are working hard to uphold all the licensing rules but nobody can absolve you from all legal risks.
There may be a court ruling/new law that any training needs a special permission from the original author and then even a CC-BY license won't cover this.
Laypeople value the aesthetics of statements like these. It's very Discord energy.
Everyone using learned weights from other models, especially ones released by OpenAI, Stability and Google, such as text and audio encoders, is tainted by training materials that were not expressly licensed for the purpose of AI model training or unlimitedly licensed for any use.
That's true but you make it sound like it's totally obvious where the line of fair use should be drawn for AI training.
Until courts or lawmakers make it clearer I personally believe non-generative models (Whisper, ResNet, DINOv2) should be legally trainable on publicly released data. Generative models (image or video generation, TTS, LLMs?) should be held to a much higher scrutiny since their outputs can potentially compete the creators who put a lot of creativity into their art. That not true for an ImageNet-trained classification model or Whisper ASR.
I believe your use should be protected. This is not meant to be a takedown, better you hear it from me though, because you'll never hear it from Discord.
> "We are working only with properly licensed"
...versus:
> "fair use"
You're smart enough to know these are very different things - saying you believe you are protected by fair use, and claiming that the data is "properly licensed." In legal there is a colossal difference, you went from say you were authorized to use something to you believe you are unauthorized but still permitted due to fair use.
Yeah, thanks. I'd love to try to clarify this (I'll put this into our documentation as well ASAP) for anyone that may be reading this in the future:
Our model is not a derivative of Whisper but we do use the transcripts (and encoder outputs) in our data preprocessing. I am convinced this data cannot be regarded as a derivative of the (proprietary) Whisper training set. Whisper itself is open-source (MIT) and has no special TOS (unlike, say, ChatGPT).
WhisperSpeech itself is trained on mostly Public Domain and some CC-BY-SA recordings. We are working on adding more data and we will provide clearer documentation on all the licenses.
Is that less certain than the quote implies?