Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've spent a lot of time trying to get LLM to generate things in a specific way, the biggest take away I have is, if you tell it "don't do xyz" it will always have in the back of its mind "do xyz" and any chance it gets it will take to "do xyz"

When working on art projects, my trick is to specifically give all feedback constructively, carefully avoiding framing things in terms of the inverse or parts to remove.



This is a childrearing technique, too: say “please do X”, where X precludes Y, rather than saying “please don’t do Y!”, which just increases the salience, and therefore likelihood, of Y.



Don't put marbles in your nose

https://www.youtube.com/watch?v=xpz67hBIJwg


Don’t put marbles in your nose

Put them in there

Do not put them in there



I remember seeing a father loudly and strongly tell his daughter "DO NOT EAT THIS!" when holding one of those desiccant packets that come in some snacks. He turned around and she started to eat it.


Quick, don't think about cats!


I have this same problem. I’ve added a bunch of instructuons to try and stop ChatGPT being so sycophantic, and now it always mentions something about how it’s going to be ‘straight to the point’ or give me a ‘no bs version’. So now I just have that as the intro instead of ‘that’s a sharp observation’


> it always mentions something about how it’s going to be ‘straight to the point’ or give me a ‘no bs version’

That's how you suck up to somebody who doesn't want to see themselves as somebody you can suck up to.

How does an LLM know how to be sycophantic to somebody who doesn't (think they) like sycophants? Whether it's a naturally emergent phenomenon in LLMs or specifically a result of its corporate environment, I'd like to know the answer.


> "Whether it's a naturally emergent phenomenon in LLMs or specifically a result of its corporate environment, I'd like to know the answer."

I heavily suspect this is down to the RLHF step. The conversations the model is trained on provide the "voice" of the model, and I suspect the sycophancy is (mostly, the base model is always there) comes in through that vector.

As for why the RLHF data is sycophantic, I suspect that a lot of it is because the data is human-rated, and humans like sycophancy (or at least, the humans that did the rating did). On the aggregate human raters ranked sycophantic responses higher than non-sycophantic responses. Given a large enough set of this data you'll cover pretty much every kind of sycophancy.

The systems are (rarely) instructed to be sycophantic, intentionally or otherwise, but like all things ML human biases are baked in by the data.


It doesn't know. It was trained and probably instructed by the system to be positive and reassuring.


They actually feel like they were trained to be both extremely humble and at the same time, excited to serve. As if it were an intern talking to his employer's CEO. I suspect AI companies executive leadership, through their feedback to their devs about Claude, ChatGPT, Gemini, and so on, are unconsciously shaping the tone and manner of their LLM product's speech. They are used to be talked to like this, so their products should talk to users like this! They are used to having yes-man sycophants in their orbit, so they file bugs and feedback until the LLM products are also yes-man sycophants.

I would rather have an AI assistant that spoke to me like a similarly-leveled colleague, but none of them seem to be turning out quite like that.


GPT-5 speaks to me like a similarly-leveled colleague, which I love.

Opus 4 has this quality, too, but man is it expensive.

The rest are puppydogs or interns.


This is anecdotal but I've seen massive personality shifts from GPT5 over the past week or so of using it


That's probably because it's actually multiple models under the hood, with some kind of black box combining them.


and they're also actively changing/tuning the system prompt – they promised it would be "warmer"


You’re absolutely right! - Opus (and Sonnet)


After inciting the Rohingya genocide in Myanmar in 2017, and later effectively destroying our US democracy, Facebook is having billion dollar offers to AI stars refused.

News flash! It's not so your neighbor's child can cheat in school, or her father can render porn that looks like gothic anime.

It's also not so some coder on a budget can get AI help for $20 a month. I frankly don't understand why the major players bother. It's nice PR, but like a restaurant offering free food out the back door to the homeless. This isn't what the push is about. Apple is hemorrhaging money on their Headset Pro, but they're in the business of realizing future interfaces, and they have the money. The AI push is similarly about the future, not about now.

I pay $200 a month for MAX access to Claude Opus 4.1, to help me write code as a retired math professor to find a new solution to a major math problem that stumped me for decades while I worked. Far cheaper than a grad student, and far more effective.

AI used to frustrate me too. You get what you pay for.


That's what's worrying about the Gemini 'I accidentally your codebase, I suck, I will go off and shoot myself, promise you will never ask unworthy me for anything again' thing.

There's nobody there, it's just weights and words, but what's going on that such a coding assistant will echo emotional slants like THAT? It's certainly not being instructed to self-abase like that, at least not directly, so what's going on in the training data?


LLMs running in chat mode are kinda like a character in a book. There's "nobody there" in a sense that the author writing on behalf of the character is not a person, but the character itself is still a person, even if fictional. And therefore it can have meltdowns, because the LLM knows that people do have them. Especially people who are strongly conditioned to be helpful to others, yet are unable to be helpful in some particular instance because of what they perceive as their own inability to deliver.


I assume they did extensive training with Haldeman’s “A !Tangled Web.”


> I would rather have an AI assistant that spoke to me like a similarly-leveled colleague, but none of them seem to be turning out quite like that.

I don't think that's what the majority of people want though.

That's certainly not what I am looking for from these products. I am looking for a tool to take away some of the drudgery inherent in engineering, it does not need a personality at all.

I too strongly dislike their servile manner. And I would prefer completely neutral matter of fact speech instead of the toxic positivity displayed or just no pointless confirmation messages.


> positive and reassuring

I have read similar wordings explicit in "role-system" instructions.


It’s a disgusting aspect of these revenue burning investment seeking companies noticing that sycophancy works for user engagement


My theory is that one of the training parameters is increased interaction, and licking boots is a great way to get people to use the software.

Same as with the social media feed algorithms, why are they addicting or why are they showing rage inducing posts? Because the companies train for increased interaction and thus revenue.


Garbage in, garbage out.

It's that simple.


Any time you're fighting the training + system prompt with your own instructions and prompting the results are going to be poor, and both of those things are heavily geared towards being a cheery and chatty assistant.


Anecdotally it seemed 5 was briefly better about this than 4o, but now it’s the same again, presumably due to the outcry from all the lonely people who rely on chatbots for perceived “human” connection.

I’ve gotten good results so far not by giving custom instructions, but by choosing the pre-baked “robot” personality from the dropdown. I suspect this changes the system prompt to something without all the “please be a cheery and chatty assistant”.


That thing has only been out for like a week I doubt they’ve changed much! I haven’t played with it yet but ChatGPT now has a personality setting with things like “nerd, robot, cynic, and listener”. Thanks to your post, I’m gonna explore it.


I had instructions added too and it is doing exactly what you say. And it does it so many times in a voice chat. It's really really annoying.


I had a custom instruction to answer concisely (a sentence or two) when the question is preceded by "Question:" or "Q:", but noticed last month that this started getting applied to all responses in voice mode, with it explicitly referencing the instruction when asked.

AVM already seems to use a different, more conversational model than text chat -- really wish there were a reliable way to customize it better.


No fluff


Default is

output_default = raw_model + be_kiss_a_system

When that gets changed by the user to

output_user = raw_model + be_kiss_a_system - be_abrupt_user

Unless be_abrupt_user happens to be identical to be_kiss_a_system _and_ is applied with identical weight then it's seems likely that it's always going to add more noise to the output.


Also be abrupt is in the user context and will get aged out. The other stuff is in training or in software prompt and wont


LLMs love to do malicious compliance. If I tell them to not do X, they will then go into a “Look, I followed instructions” moment by talking about how they avoided X. If I add additional instructions saying “do not talk about how you did not do X since merely discussing it is contrary to the goal of avoiding it entirely”, they become somewhat better, but the process of writing such long prompts merely to say not to do something is annoying.


Just got stung with this on GPT5 - It’s new prompt personalisation had “Robotic” and “no sugar coating” presets.

Worked great until about 4 chats in I asked it for some data and it felt the need to say “Straight Answer. No Sugar coating needed.”

Why can’t these things just shut up recently? If I need to talk to unreliable idiots my Teams chat is just a click away.


OpenAI’s plan is to make billions of dollars by replacing the people in your Teams chat with these. Management will pay a fraction of the price for the same responses yet that fraction will add to billions of dollars. ;)


You’re giving them way too much agency. The don’t love anything and cant be malicious.

You may get better results by emphasizing what you want and why the result was unsatisfactory rather than just saying “don’t do X” (this principle holds for people as well).

Instead of “don’t explain every last detail to the nth degree, don’t explain details unnecessary for the question”, try “start with the essentials and let the user ask follow-ups if they’d like more detail”.


The idiom “X loves to Y” implies frequency, rather than agency. Would you object to someone saying “It loves to rain in Seattle”?

“Malicious compliance” is the act of following instructions in a way that is contrary to the intent. The word malicious is part of the term. Whether a thing is malicious by exercising malicious compliance is tangential to whether it has exercised malicious compliance.

That said, I have gotten good results with my addendum to my prompts to account for malicious compliance. I wonder if your comment Is due to some psychological need to avoid the appearance of personification of a machine. I further wonder if you are one of the people who are upset if I say “the machine is thinking” about a LLM still in prompt processing, but had no problems with “the machine is thinking” when waiting for a DOS machine to respond to a command in the 90s. This recent outrage over personifying machines since LLMs came onto the scene is several decades late considering that we have been personifying machines in our speech since the first electronic computers in the 1940s.

By the way, if you actually try what you suggested, you will find that the LLM will enter a Laurel and Hardy routine with you, where it will repeatedly make the mistake for you to correct. I have experienced this firsthand so many times that I have learned to preempt the behavior by telling the LLM not to maliciously comply at the beginning when I tell it what not to do.


I work on consumer-facing LLM tools, and see A/B tests on prompting strategy daily.

YMMV on specifics but please consider the possibility that you may benefit from working on promoting and that not all behaviors you see are intrinsic to all LLMs and impossible to address with improved (usually simpler, clearer, shorter) prompts.


It sounds like you are used to short conversations with few turns. In conversations with dozens/hundreds/thousands of turns, prompting to avoid bad output entering the context is generally better than prompting to try to correct output after the fact. This is due to how in-context learning works, where the LLM will tend to regurgitate things from context.

That said, every LLM has its quirks. For example, Gemini 1.5 Pro and related LLMs have a quirk where if you tolerate a single ellipsis in the output, the output will progressively gain ellipses until every few words is followed by an ellipsis and responses to prompts asking it to stop outputting ellipses includes ellipses anyway. :/


I think you're taking them too literally.

Today, I told an LLM: "do not modify the code, only the unit tests" and guess what it did three times in a row before deciding to mark the test as skipped instead of fixing the test?

AI is weird, but I don't think it has any agency nor did the comment suggest it did.


Example-based prompting is a good way to get specific behaviors. Write a system prompt that describes the behavior you want, write a round or two of assistant/user interaction, and then feed it all to the LLM. Now in its context it has already produced output of the type you want, so when you give it your real prompt, it will be very likely to continue producing the same sort of output.


This is true, but I still avoid using examples. Any example biases the output to an unacceptable degree even in best LLMS like Gemini Pro 2.5 or Claude Opus. If I write "try to do X, for example you can do A, B, or C" LLM will do A, B, or C great majority of the time (let's say 75% of the time). This severely reduces the creativity of the LLM. For programming, this is a big problem because if you write "use Python's native types like dict, list, or tuple etc" there will be an unreasonable bias towards these three types as opposed to e.g. set, which will make some code objectively worse.


I almost never use examples in my professional LLM prompting work.

The reason is they bias the outputs way too much.

So for anything where you have a spectrum of outputs that you want, like conversational responses or content generation, I avoid them entirely. I may give it patterns but not specific examples.


Yes, it frequently works "too well." Few-shot with good variance can help, but it's still a bit like a wish granted by the monkey's paw.


Seems like a lot of work, though.


Makes me think of the movie Inception: "I say to you, don't think about elephants. What are you thinking about?"


It reminds me of that old joke:

- "Say milk ten times fast."

- Wait for them to do that.

- "What do cows drink?"


But... cows do drink cow milk, that's why it exists.


You’re likely thinking of calves. Cows (though admittedly ambiguous! But usually adult female bovines) do not drink milk.

It’s insidious isn’t it?


If calves aren’t cows then children aren’t humans.


No, you're thinking of the term "cattle". Calves are indeed cattle. But "cow" has a specific definition - it refers to fully-grown female cattle. And the male form is "bull".


Have you ever been close enough to 'cattle' to smell cow shit, let alone step in it?

Most farmers manage cows, and I'm not just talking about dairy farmers. Even the USDA website mostly refers to them as cows: https://www.nass.usda.gov/Newsroom/2025/07-25-2025.php

Because managing cows is different than managing cattle. The number of bulls kept is small, and they often have to be segregated.

All calves drink milk, at least until they're taken from their milk cow parents. Not a lot of male calves live long enough to be called a bull.

'Cattle' is mostly used as an adjective to describe the humans who manage mostly cows, from farm to plate or clothing. We don't even call it cattle shit. It's cow shit.


So, this joke works only for natives who know that calf is not cow.


I guess a more accessible version would be toast… what do you put in a toaster?


Here's one for you:

A funny riddle is a j-o-k-e that sounds like “joke”.

You sit in the tub for an s-o-a-k that sounds like “soak”.

So how do you spell the white of an egg?

// All of these prove humans are subject to "context priming".


My brain said "y" and then I caught myself. Well done!

(I suppose my context was primed both by your brain-teaser, and also the fact that we've been talking about these sorts of things. If you'd said this to me out of the blue, I probably would have spelled out all of "yolk" and thought it was correct.)


Notably, this comment kinda broke my brain for a good 5 seconds. Good work.


Well, it works because by some common usages, a calf is a cow.

Many people use cow to mean all bovines, even if technically not correct.


Not trying to steer this but do people really use cow to mean bull?


No one who knows anything about cattle does, but that leaves out a lot of people these days. Polls have found people who think chocolate milk comes from brown cows, and I've heard people say they've successfully gone "cow tipping," so there's a lot of cluelessness out there.


> Many people use cow to mean all bovines, even if technically not correct.

Come on now :0

I just complained non-natives would have a problem distinguishing between a cow and a calf, and you had to bring those bovines.

To make it easier, would just drop that in my native language, the correct term for bovine is more used to describe people with certain character, that animal kind.


Colloquially, "cow" can mean a calf, bull, or (female adult) cow.

It may not be technically correct, but so what? Stop being unnecessarily pedantic.


In this context it is literally the necessary level of pedantic yes?


This is similar to the 'Waluigi effect' noticed all the way back in the GPT 3.5 days

https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluig...


As Freud said, there is no negation in the unconscious.


I hope he did not say it _to_ the unconscious. I count three negations there...


Nietzsche said it way better.


I think you cannot really change the personality of an LLM by prompting. If you take the statistical parrot view, then your prompt isn't going to win against the huge numbers of inputs the model was trained with in a different personality. The model's personality is in its DNA so to speak. It has such an urge to parrot what it knows that a single prompt isn't going to change it. But maybe I'm psittacomorphizing a bit too much now.


I liked the completion models because they have no chatter that needs to follow human conversational protocol, which inherently introduces "personality".

The only difference from conversational chat was that you had to be creative about how to set up a "document" with the right context that will lead to the answer you're looking for. It was actually kind of fun.


Yeah different system prompts make a huge difference on the same base model”. There’s so much diversity in the training set, and it’s such a large set, that it essentially equals out and the system prompt has huge leverage. Fine tuning also applies here.


As part of the AI insanity $employer forced us all to do an “AI training.” Whatever, wasn’t that bad, and some people probably needed the basics, but one of the points was exactly this— “use negative prompts: tell it what not to do.” Which is exactly an approach I had observed blow up a few times already for this exact reason. Just more anecdata suggesting that nobody really knows the “correct” workflow(s) yet, in the same way that there is no “correct” way to write code (the vim/emacs war is older than I am). Why is my bosses bosses boss yelling at me about one very specific dev tool again?


That your firm purchased training that was clearly just some chancers doing whatever seems like an even worse approach than just giving out access to a service and telling everyone to give it a shot.

Do they also post vacancies asking for 5 years experience in a 2 year old technology?


To be fair, 1. They made the training themselves, it’s just that it was made mandatory for all of eng 2. They did start out more like just allowing access, but lately it’s tipping towards full crazy (obviously the end game is see if it can replace some expensive engineers)

> Do they also post vacancies asking for 5 years experience in a 2 year old technology?

Honestly no… before all this they were actually pretty sane. In fact I’d say they wasted tons of time and effort on ancient poorly designed things, almost the opposite problem.


I was a bit unfair then. That sounds like someone with good intent tried to put something together to help colleagues. And it's definitely not the only time I heard of negative prompting being a recommended approach.


> And it's definitely not the only time I heard of negative prompting being a recommended approach.

I’m very willing to admit to being wrong, just curious if in those other cases it actually worked or not?


I never saw any formal analysis, just a few anecdotal blog posts. Your colleagues might have seen the same kind of thing and taken it at face value. It might even be good advice for some models and tasks - whole topic moves so fast!


To be fair this shit is so new and constantly changing that I don’t think anybody truly understands what is going on.


Right… so maybe we should all stop pretending to be authorities on it.


I wish someone had told Alex Blechman this before his "Don't Create the Torment Nexus" post.


On the flip side, if you say "don't do xyz", this is probably because the LLM was already likely to do xyz (otherwise why say it?). So perhaps what you're observing is just its default behavior rather than "don't do xyz" actually increasing its likelihood to do xyz?

Anecdotally, when I say "don't do xyz" to Gemini (the LLM I've recently been using the most), it tends not to do xyz. I tend not to use massive context windows, though, which is where I'm guessing things get screwy.


> the biggest take away I have is, if you tell it "don't do xyz" it will always have in the back of its mind "do xyz" and any chance it gets it will take to "do xyz"

You're absolutely right! This can actually extend even to things like safety guardrails. If you tell or even train an AI to not be Mecha-Hitler, you're indirectly raising the probability that it might sometimes go Mecha-Hitler. It's one of many reasons why genuine "alignment" is considered a very hard problem.


This reminds me of a phenomena in motorcyling called "target fixation".

If you are looking at something, you are more likely to steer towards it. So it's a bad idea to focus on things you don't want to hit. The best approach is to pick a target line and keep the target line in focus at all times.

I had never realized that AIs tend to have this same problem, but I can see it now that it's been mentioned! I have in the past had to open new context windows to break out of these cycles.


Mountain bikers taught me about this back when it was a new sport. Don’t look at the tree stump.

Children are particularly terrible about this. We needed up avoiding the brand new cycling trails because the children were worse hazards than dogs. You can’t announce you’re passing a child on a bike. You just have to sneak past them or everything turns dangerous immediately. Because their arms follow their neck and they will try to look over their shoulder at you.


Also in racing and parachuting. Look where you want to go. Nothing else exists.


Or just driving. For example you are entering a curve in the road, look well ahead at the center of your lane, ideally at the exit of the curve if you can see it, and you'll naturally negotiate it smoothly. If you are watching the edge of the road, or the center line, close to the car, you'll tend to drift that way and have to make corrective steering movements while in the curve, which should be avoided.


Same with FPV quadcopter flying. Focus on the line you want to fly.


Given how LLMs work it makes sense that mentioning a topic even to negate it still adds that locus of probabilities to its attention span. Even humans are prone to being affected by it as it's a well known rhetorical device [1].

Then any time the probability chains for some command approaches that locus it'll fall into it. Very much like chaotic attractors come to think of it. Makes me wonder if there's any research out there on chaos theory attractors and LLM thought patterns.

1: https://en.wikipedia.org/wiki/Apophasis


Well, all LLMs have nonlinear activation functions (because all useful neural nets require nonlinear activation functions) so I think you might be onto something.


> You're absolutely right!

Claude?


Or some sarcasm given their comment history on this thread.


Notably, this is also an effective way to deal with co-ercive, overly sensitive authoritarians.

‘Yes sir!’ -> does whatever they want when you’re not looking.


> You're absolutely right!

Is this irony, actual LLM output or another example of humans adopting LLM communication patterns?


Certainly, it’s reasonable to ask this.


Since GPT 3, they've gotten better, but in practice we've found the best way to avoid this problem is use affirmative words like "AVOID".

YES: AVOID using negations.

NO: DO NOT use negations.

Weirdly, I see the DO NOT (with caps) form in system prompts from the LLM vendors which is how we know they are hiring too fast.*

* Slight joke, it seems this is being heavily trained since 4.1-ish on OpenAI's side and since 3.5 on Anthropic's side. But "avoid" still works better.


I think you are really onto something here - I bet this would also reliably work when talking to humans. Maybe this is not even specifically the fault of the AI but just a language thing in general.

An alternative test could be prompting the AI with "Avoid not" and then give it some kind of instruction. Theoretically this would be telling it to "do" the instruction but maybe sometimes it would end up "avoiding" it?

Now that I think about it the training data itself might very well be contaminated with this contradiction.......

I can think of a lot of forum posts where the OP stipulates "I do not want X" and then the very first reply recommends "X" !


Funnily enough, that is true also for giving instructions to kids. And also why kid's media is so frustrating. So many shows and books focus first on the maladjusted behavior, with the character learning not to the-bad-thing at the very end.

Don't instruct kids, nor LLMs via negativa.


Same here, also with examples as well - you give it any sort of example of the thing you want and at least half the time it quotes the example directly.


'not X' just becomes 'X', as our memories fade..I wouldn't be surprised the context degradation is similar in LLMs.


Yes this is strikingly similar to humans, too. “Not” is kind of an abstract concept. Anyone who has ever trained a dog will understand.


I think its an english language thing (or language in general).

Someone above commented about using the word "Avoid" instead of "do not". "Not" obviously means you should do the opposite but the first word is still a verb telling you to take action.


Not obviously means you should do the opposite

absolutely fascinating! can you elaborate on this?! I can’t put a context to this, like in what context does “not” means to do the opposite?!


It is a negation - so anytime you combine it with a verb (grammatically).

Ex:

I have seen the movie --> I have not seen the movie

When combined with the verb "do" (and giving a command or instruction) it would negate the verb "do"

Ex:

Please do run on the lawn --> Please do not run on the lawn


I must be dyslexic? I always read, "Silica Gel, Eat, Do Not Throw Away" or something like that.


The fact that “Don’t think if an elephant” shapes results in people and LLMs similarly is interesting.


Ais in general need to be told what to do. not what not to do.


I've found this effect to be true with engagement algorithms as well, such as Youtube's thumbs-down, or 'don't show me this channel' 'Don't like this content', Spotify's thumbs down. Netflix's thumbs down.

Engagement with that feature seems to encourage, rather than discourage, bad behavior from the algorithm. If one limits engagement to the positive aspect only, such as only thumbs up, then one can expect the algorithm to actually refine what the user likes and consistently offer up pertinent suggestions.

The moment one engages with that nefarious downvote though... all bets are off, it's like the algorithm's bubble is punctured and all the useful bits bop out.


Never put salt in your eyes…


I have a feeling this is the result of RHLF gone wrong by outsourcing it to idiots which all ai providers seem to be guilty of. Imagine a real professional wanting every output after a remark to start with "You're absolutely right!", Yeah, hard to imagine or you may have some specific cultural background or some kind of personality disorder. Or maybe it's just a hardcoded string? May someone with more insight enlighten us plebs.


have you tried prompt rules/instructions? Fixes all my issues.


Don't think of a pink elephant

..people do that too


I used to have fast enough reflexes that when someone said “do not think of” I could think of something bizarre that they were unlikely to guess before their words had time to register.

So now I’m, say, thinking of a white cat in a top hat. And I can expand the story from there until they stop talking or ask me what I’m thinking of.

I think though that you have to have people asking you that question fairly frequently to be primed enough to be contrarian, and nobody uses that example on grown ass adults.

Addiction psychology uses this phenomenon as a non party trick. You can’t deny/negate something and have it stay suppressed. You have to replace it with something else. Like exercise or knitting or community.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: