AHhhhhhhhhhhhhhhhhhhhhhhhhhhhh. That's me screaming. I constantly wonder why thi...

AHhhhhhhhhhhhhhhhhhhhhhhhhhhhh. That's me screaming. I constantly wonder why this stuff does not exist in LLMs. But my technical depth and competence is quite low. Way lower than the people implementing the models and samplers. So I just assume: there must be a good reason, right. Right?

But recently I threw a just a bit of similar-ish stuff as you describe there into a TTS model, barely knowing anything, and yeah it's totally works and is fun and cool. The stuff that doesn't work fails in interesting and strange ways, so it almost STILL works. (Well, it gives people really bizarre speech impediments, at least...)

I was just working on prompt editing actually. Which is weird to imagine in a TTS model. It makes sense for the future tokens of course, for words the model has not said yet. I think it even makes sense for the past right? You can rewrite the past context, and it still changes future output audio model. In bark it's two different things: one is the text prompt, and one is the generated audio tokens/context, which is not the same. (The text and the past audio is concatted in the Bark prompt, so this idea makes sense in Bark but not in other models. You could change either text OR 'what was generated with the text' independently.)

As long as you don't rewrite the time touching the last token, at 0 seconds - if it's like a segment 2 to 4 seconds in the past, it should influence future output but not cause a discontinuity in the audio. I think?

BTW an easy and fun thing - just let generation parameters be dependent variables. Of anything.

A trivial example: why is temperature just a number, why not a function? Like the temp varies according to how far long in the prompt you are. For music, just that is already a fun tool. Now as a music segment starts or ends the style transitions. Or: spike the temperature at regular intervals - like use a sine wave for temp, input is current token position. You can probably imagine that works great in music model.

Even in a TTS model this you can get weird and diverse speech patterns.

The thing is: I really very a low level of competence. Total monkey hitting keys and googling, and even I can make it work, easily. Sampling is just a loop, okay, what if I copy logits from sample A and subtract them from sample B. What if take the last generation, save the tokens, ban then in the next. Really just do anything and you end up in interesting places in the model you didn't know existed and are often cool. (Recently, TTS output overlapping speech, for example.)

Like I recently generated french accents from any voice in the Bark TTS model, with no fine-tuning, no training, actually not even really any AI. Just by counting token frequencies in the french voices, and having the sampler loop go, "Okay let's bump these those logits up a bit, and the others down" and it just somehow works. No Loras, no fine-tuning, not stats, it's like middle school level math, but sounded great.

(I'm in a bit of a stream of consciousness ramble mode from lack of sleep, but I'll keep going on this message anyway so I don't forget to come back to your post when I'm back at normal capacity. And just hope I don't cringe too hard reading this when better rested.)

Oh I'd love to hear your thoughts on negative prompts in LLMs.

1) What does 'working correctly' look like?

For an audio LLM, I'm thinking something like: a negative prompt "I'm screaming and I hate you!!!" makes the model more inclined to generate quieter, friendly speech, in your positive prompt. Something like that?

2) How to make it work.

This is probably very model dependent and fiddly. My first thought was generate two samples in sequence. The first sample is the negative prompt. Save all the logits and tokens. Use them as a negative influence in the second prompt. At least in Bark you can't just like flat subtract them or what you actually get is more like 'the opposite of speech' than 'the opposite of your prompt' but when I did french accents I basically just fiddled with a bunch of constant values and weights and eventually it worked. So I'm hoping the same applies. I can imagine a more complicated versions where you do some more math to figure out what's unique about a text prompt, versus 'a generic sentence from that language' and only push on those logits. I suppose that might be necessary.