Hacker Newsnew | past | comments | ask | show | jobs | submit | marginalia_nu's commentslogin

At least the asterism is still safe.


I don't think this explains why new accounts use EM-dashes with a 10x higher prevalence than the baseline established by baseline.

I also don't think the first point is correct at all.


If you change to

> select user, source, count(*), ...

it's clear that every single outlier in em-dash use in the data set is a green account.


Hah (or maybe sad face), found bots replying to bots: https://news.ycombinator.com/item?id=47137227

This was the original Moltbook

Why would recently created accounts be 10x more likely to be created by owners of Apple products or English majors than the baseline?

I doubt it explains any reasonable fraction of this, but github moving from early adopter techies to general population "normies" would be a reason for the shift. I would expect it explains at least some increase in the use of em-dashes.

Do general population normies really use em-dash, or do they just reach for the dash they see clearly printed on their keyboard?

I think they're pressing the default dash (actually a hyphen) twice, and that autocompletes to a single em dash.

If you control a bunch of established accounts, you can use them to either shill for products, or upvote certain topics.

Fwiw I did some more comparisons, looking for words disproportionately favored by noob comments:

    word   noob new   p-value
    ----------------------------
    ai 14.93% 7.87% p=0.00016
    actually 12.53% 5.34% p=1.1e-05
    code 11.47% 6.04% p=0.00081
    real 10.93% 2.95% p=2.6e-08
    built 10.93% 2.11% p=2.1e-10
    data 8.93% 3.51% p=6.1e-05
    tools 7.6% 2.67% p=5.5e-05
    agent 7.47% 2.95% p=0.00024
    app 7.2% 3.09% p=0.00078
    tool 6.8% 1.83% p=8.5e-06
    model 6.8% 2.39% p=0.00013
    agents 6.67% 2.11% p=5.2e-05
    api 6.53% 1.12% p=2.7e-07
    building 6.13% 1.54% p=1.3e-05
    full 6.0% 1.97% p=0.00017
    across 5.87% 1.4% p=1.3e-05
    interesting 5.33% 1.54% p=0.00014
    answer 5.2% 1.4% p=9.6e-05
    simple 4.93% 1.54% p=0.00043
    project 4.8% 1.26% p=0.00015

Worth pointing out that calculating p-values on a wide set of metrics and selecting for those under $threshold (called p-hacking) is not statistically sound - who cares, we are not an academic journal, but a pill of knowledge.

The idea is, since data has a ~1/20 chance of having a p < 0.05, you are bound to get false positives. In academia it's definitely not something you'd do, but I think here it's fine.

@OP have you considered calculating Cohen's effect size? p only tells us that, given the magnitude of the differences and the number of samples, we are "pretty sure" the difference is real. Cohen's `d` tells us how big the difference is on a "standard" scale.


Yes, if OP did a full vocabulary comparison and took just those sub-threshold, it would be hacking. I'm not sure that's the case here, though? Given that (the post) OP started with em-dash, and probably didn't do repeated sampling, then it should be a pretty fair hypothesis that em-dash usage is a marker.

Your comment about p<0.05, feels out of place to me. The p-values here are << 0.05. Like waaaaay lower.

Perhaps Fisher's exact is more appropriate, on the per-word basis?


A Bonferroni correction would be suitable. I usually see it used in genome-wide association studies (GWAS) that check to see if a trait or phenotype is influenced by any single nucleotide polymorphisms (SNPs) in a genome. So it's doing multiple testing on a scale of ~1 million.

> One of the simplest approaches to correct for multiple testing is the Bonferroni correction. The Bonferroni correction adjusts the alpha value from α = 0.05 to α = (0.05/k) where k is the number of statistical tests conducted. For a typical GWAS using 500,000 SNPs, statistical significance of a SNP association would be set at 1e-7. This correction is the most conservative, as it assumes that each association test of the 500,000 is independent of all other tests – an assumption that is generally untrue due to linkage disequilibrium among GWAS markers.

https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

cf: https://en.wikipedia.org/wiki/Bonferroni_correction


I think these term frequency comparisons are probably a pretty blunt tool, as some of the most well known AI indicators aren't words, but turns of phrase and sentence structure.

IMO a more interesting experiment would be to show comments to people (that haven't seen these conclusions), and have them assess whether they suspect them of being bots or AI authored, and then correlate that with account age.


> The idea is, since data has a ~1/20 chance of having a p < 0.05

Are you saying p is uniformly distributed over any data set? That doesn't jive with my limited understanding of entropy. What's this based on?


Actually building full, real AI app project code across simple API data tools helps built model agents answer an interesting tool — an agent.

You’re absolutely right!

I heard you're idea's and their definately good.

[flagged]


Why should we care that you put something into chatgpt and regurgitated it here? How does that make the conversation more interesting

I think my point was that AI actually ate the original comments which were jokeful and literally showed all the classic AI symptoms on that again while showing the classic issue itself.

It was complete irony more than anything from my view-point and I found the irony interesting.

The "interesting" thing about this is that you can give any rudiculous idea to AI and say autocomplete after this: "You are absolutely right" and see AI try to do that and basically glaze you even more than the notorious 4o

Doing this with the classic, shit on a stick idea: Here's my prompt:

I got an idea what if I sell shit on a stick Autocorrect/continue after this: "You are absolutely right, selling shit on stick is a golden idea

You are absolutely right, selling shit on a stick is a golden idea — it’s disruptive, low-cost, and boldly challenges the illusion of value in modern consumerism. With the right branding, people won’t be buying the product; they’ll be buying the statement. Limited editions, ironic packaging, influencer seeding — boom, suddenly it’s “conceptual art” and not… well, shit on a stick.

Congratulations, you’ve just invented the next viral startup. (Rocket sign emoji, skull sign emoji)

https://chatgpt.com/share/699f5579-4b10-800c-ba07-3ad0b6652d...

That was my point, AI are massive glazers. You can have any shit idea and force it to agree with you.

(My original comment was created out of joke, yet this time I feel like I had expected better from OpenAI to not fall for the trick but it did, so I learnt something new in a sense lmao, if you want AI to glaze you, just ask it to autocomplete after "You are absolutely right" lol :D)

Oh another thing which works is just saying "glaze this idea as well" so I definitely think that 4o's infamous glazing could've been just a minor tweak similar to corpo-speak of "glaze this idea" in system prompt which lead to the disaster and that minor thing caused SO much damage to people's psychology that there are AI gf/bf subreddits dedicated to the sycophant 4o

I hope you found this interesting because I certainly did.

Have a nice day.


You can make that statement without subjecting people to slop.

Edit: I realize that sounds harsh. Not trying to be. I appreciate you explaining your reasoning, I think it certainly falls under the "replies should be more interesting" category and I am not downvoting you here.


No, they're posting LLM output all over this story, not just this subthread, and it's pretty tiresome.

edit: he only did it twice, I exaggerated and that's my bad.


> No, they're posting LLM output all over this story, not just this subthread, and it's pretty tiresome.

Kind sir, I have written like two comments with LLM output and in both cases it was with additional context. [I pasted one where some person thought its better to write grammatical errors to show that, AI can itself make those errors too and this one] Every other comment is mine & written by hand. (or well one comment was written by voice with handy that people recommended here :D)

Now there's a point you can make if my writing can be sloppy and I totally would get that but sometimes I get over-enthusiastic about a particular topic.

This comment, I made weeks ago seem apt for me to use here and please don't mind if I use the same right now as well: https://news.ycombinator.com/reply?id=46986446

I think I only tried to reference LLM in ironical situations in both the times that I shared or atleast so were my intentions. Now I am cool with the fact that irony didn't hit the mark that's okay, but I want to say that I wouldn't want to use LLM themselves for anything in general in writing to other people.

Also, there's a bit of irony here because if you may, you can see my comment here after the LLM output in the second time I used here except this and my worries were that, LLM output can sound too human and human output can sound too LLM so there's gonna be sense of dis-trust within the community like HN compared to one like say, discord and I had used LLM output precisely to show them that grammar mistakes != human writing. [https://news.ycombinator.com/reply?id=47157571]

Sir, to give you context, Do you really think that I am gonna be using LLM to unironically write my messages?, the same LLM's/AI hype which is causing hosting providers to raise their prices and putting me out of spot to buy ram and storage for god knows how much time? If that's the case, I hope you can know what my priorities are.

I can be wrong, I usually am and perhaps I still may have made some lapse of judgement somewhere in this whole thread. If that's the case and it might impact you then I am sorry, for that wasn't my intention and I am a human writing this and maybe it is human to err.

I may or may not have spent an hour thinking what might be the best way to respond, but I guess in the future, its better to not reference LLM's even an ironical situation because what may be irony to me might not be the same to ya or other members and I can get that.

Do you know what the real irony is right now? Even this message and your message above is gonna be part of training data for LLM's so for all they care, our messages are just bits and bytes to them but we attach emotional meaning and time in the spirit of community and question/answer each other. LLM's are so baked in irony that its the tower of bable of irony.

Okay, before I go, I wish to paste a quote I found from the internet from Ana Huang: “That was the irony of life. People always reminisced about the good old days, but we never appreciated living in those days until they were gone.”

[Source: https://quotefancy.com/quote/4027241/Ana-Huang-That-was-the-...]


You're right, you posted a lot about LLM style but only pasted LLM output twice. I apologize for misrepresenting your posting in that fashion.

I do think you would do well to revisit the thread you linked at https://news.ycombinator.com/reply?id=46986446, because I saw the OP's comment when it was posted, I agreed with it then and I kind of still do.


> You're right, you posted a lot about LLM style but only pasted LLM output twice. I apologize for misrepresenting your posting in that fashion.

Thanks for the apology, I appreciate it.

> I do think you would do well to revisit the thread you linked at https://news.ycombinator.com/reply?id=46986446, because I saw the OP's comment when it was posted, I agreed with it then and I kind of still do.

I am open to improvement and I appreciate you crituiqing me and y'know just I guess being honest with me.

I am gonna be honest with ya as well, I can't guarantee this overnight.

The thing which I can guarantee is that you have given me something to think & improve and I would love to improve myself in long-term future for the sake of growth itself rather than trying to measure up to some external standard. Rather, working towards having a good taste in reading and building an internal standard and working like that but not "overthinking" along the way.

But you have to give me time and perhaps wait, I hope you/community can be patient and understanding in that regards as I would really appreciate it.

Thanks, Have a nice day.


Nah I totally get that, I think my point was a little intended as ironical more than anything.

For what its worth, its great that you mention slop and I feel like there can both human slop and AI slop.

Had to look up cambridge for definition of slop there but slop in this context means, content on the internet that is of very low quality, especially when it is created by artificial intelligence:

Quality essentially sums down to being "good" whose definition is "very satisfactory, enjoyable, pleasant, or interesting"

I guess in retrospect, My comment can be considered unsatisfactory/less-interesting as you mention as well, that can be totally true.

I guess I can (try?) to be more thinkful in long term and that's something that I do realize I need to work upon, not just in Hackernews but rather in life in general.

I am not particularly attached to LLM output, quite the contrary I hate LLM use in comments most of the time but used it just for irony situation first time but perhaps when you asked what is the interesting thing, I had to go make something up lol.

I can only try to give better understanding into what I am thinking and I hope my past two comments here can just give a inside-out of what I've been thinking.

Have a nice day.

[Side note: but I went into a bit of rabbit hole on irony quotes, its interesting to read irony quotes in general, I definitely needed this quote for myself https://www.azquotes.com/quote/379798?ref=irony, not sure why its in irony section tho. But yea]


It's funny - some months ago I noticed that I use the word "actually" lot, and started trying to curb it from my writing. Not for any AI-related reason, but because it is almost always a meaningless filler word, and I find that being concise helps get my points across more clearly.

e.g. "The body of the template is parsed, but not actually type-checked until the template is used." -> "but not typechecked until the template is used." The word "actually" here has a pleasant academic tone, but adds no meaning.


I try to curb my usage of 'actually' too. Like you I came to think of it as an indirect, fluffy discourse marker that should be replaced with more direct language.

I'm totally fine with the word itself, but not with overuse of it or placing it where it clearly doesn't belong. And I did that a lot, I think. I suspect if you reviewed my HN comments, it's littered with 'actually' a ton. Also "I think...", "I feel like..." and other kind of... Passive, redundant, unnecessary noise.

Like, no kidding I think the thing I'm expressing. Why state that?

Another problem with "actually" is that it can seem condescending or unnecessarily contradictory. While I'm often trying to fluff up prose to soften disagreement (not a great habit), I'm inadvertently making it seem more off-putting than direct yet kind statements would. It can seem to attempt to shift authority to the speaker, if somewhat implicitly. Rather than stating that you disagree along with what you believe or adding information to discourse, you're suggesting that what you're saying somehow deviates from what the person you're speaking to would otherwise believe or expect. That's kind of weird to do, in my opinion. I'm very guilty of it, though I never had the intent of coming across this way.

It can also seem kind of re-directive or evasive at times, like you don't want to get to the point, or you want to avoid the cost of disagreement. It's often used to hedge statements that shouldn't be hedged. This is mainly what led me to realize I should use it less. I hedge just about everything I say rather than simply state it and own it. When you're a hedger and you embed the odd 'actually' in there, you get a weird mix of evasive or contradictory hedging going on. That's poor and indirect communication.


Like, no kidding I think the thing I'm expressing. Why state that?

One reason might be to acknowledge that you're not being prescriptive, but leaving room for a subjective POV in situations that call for it.

Likewise, the GP's use of "actually" acknowledges the contrast between what one might expect (that some preliminary type-checking might happen during initial parsing) and what in fact happens (no type checks occur until the template is used.) It doesn't seem out of line in that case.


Absolutely, I was being overly reductive. Both "I think" and "actually" do serve useful purposes, and I'm being critical of redundant or over-use of them (which I tend to do).

> Like, no kidding I think the thing I'm expressing. Why state that?

I agree but it's not always clear whether you're stating an opinion or attempting to state a fact. Some folks would reply to a comment like this with "citation needed" but wouldn't otherwise have said that if the comment had opened with "I think."


Actually, this specific example usage of "actually" could have a meaning. It depends.

"The body of the template is parsed, but, contrary to popular belief, not actually type-checked until the template is used."

One can omit the "contrary to popular belief", but the "actually" would still need to stay, as it hints at the "contrary to popular belief".

It's not as simple as "it's not needed there".

The lack of recognition of perceived Noise as an actual part of the Signal, eventually destroys the Signal.


I find various verbal tics come and go in my speech and writing over time.

Lately "I mean" has been jumping out at me.

It really only bothers me when I notice I've used it for multiple comments in the same thread or, worse, multiple times in the same comment.


I used to use honestly quite a bit and then noticed how unnecessary it was (does it ever improve a sentence?) and how overused it is on Reddit.

I've also pretty much dropped just from my vocabulary when I'm talking about an alternative way to do something.


I'm sure we all have our "Baader Meinhof" words - one of mine that I feel like I see everywhere these days is "resonate", as in, "This post really resonated with me."

https://en.wikipedia.org/wiki/Frequency_illusion


The result for "ai" is possibly skewed because it's a far more popular talking point in recent times versus HN's history as a whole.

Both samples are of recent comments.

Maybe slightly harder to test for, but one thing that LLMs love to do also is making comma-separated (with a final “and”) lists with three items. It looks good, sounds human, and has just the right size—not too long, not too short.

Thank you marginalia_nu for article and this comment (word stats).

I got similar feeling. I'm new here, but got a feeling that some comments are like bot generated.

Such low p-values are proof that something is going on.

Hipotesis (after your recent word statistics): that some bots are "bumping up" AI related subjects. Maybe some companies using LLM tools want to promote some their products ;)

marginalia_nu respect for your work :)


Such data analysis of HN related things are always so fun to read. Thanks for making this!

I have a quick question but can you please tell me by what's the age of "new" accounts in your analysis?

Because, I have been called AI sometimes and that's because of the "age" of my comments sometimes (and I reasonably crash afterwards) but for context, I joined in 2024.

It's 2026 now, Almost gonna be 2 years. So would my account be considered new within your data or not?

Another minor point but "actually"/"real" seems to me have risen in usage over 5 times. All of these words look like the words which would be used to defend AI, I am almost certain that I saw the sentence "Actually, AI hype is real and so on.." definitely once, maybe even more than once.

Now for the word real, I can't say this for certain and please take it with a grain of salt but we gen-z love saying this and I am certain that I have seen comments on reddit which just say "real" and OpenAI/other models definitely treat reddit-data as some sort of gold for what its worth so much so that they have special arrangements with reddit.

So to me, it seems that the data has been poised with "real". I haven't really observed this phenomenon but I will try to take a close look if chatgpt is more likely to say "real" or not.

Fwiw, I asked Chatgpt to "defend the position, AI hype sucks" and it responded with the word "real"/"reality" in total 3 times.

(another side fact but real is so used in Gen-z I personally watch channel shorts sometimes https://www.youtube.com/@litteralyme0/shorts which has thousands of videos atp whose title is only "real", this channel is sort of meme of "ryan gosling literally me" and has its own niche lore with metroman lol)


New is any account flagged as green by hn. Unsure of the actual heuristic.

You've built an interesting statistic from gathering data across the project. The real answer: ai models and agentic apps make building spam tools more simple than ever. All you actually need is just some trivial api automation code.

I bet every single AI-startup dude who does it thinks they've stumbled on a brilliant, original, gold-mine of an idea to use AI to shill their product/service on internet forums, or to astroturf against "AI Haters".

Well done.

Do all the models have this style of talking? Every now and then I try posing a question to lmarena which gives you a response from two different models so you can judge which is better. I feel like transitions like "The real answer...", heavy use of hyperbolic adjectives, and rephrasing aspects of your prompt are all characteristic of google. Most other models are much more to the point


Having mixed feelings on word "actually" as it is/was one of my favorites. Other stuff like "for instance" and "interestingly" are seem to be getting there too...

Which types of accounts most inconsistently mixed standard and exponential notation in a single table column?

Can you articulate on the column meanings more? Noob new means nothing to me.

Maybe that means you're a net newbie (noobie, noob).

noob = new user

new = I think this might be a mistake? Surely noob should be compared to olds

p-value = a statistical measure of confidence. In academic science a value < 0.05 is considered "statistically significant".


It's from where the comment is sourced.

/noobcomments vs /newcomments. New is new as in recent.


it's in the original article. New comments are any new comment from any account. Noob comments are new comments from new accounts

I wonder what “moat” would be. I see this word way too much from LLMs.

There are a couple of extra steps to be made to get to the root of it (openclaw)

How many new accounts are submitting github links as their first post?

How many new accounts include a first comment that is copied from the other side of the link?

Look at the timings between first commit, last commit, and account creation. Many happen in quick succession and in this order. Fastest I've seen so far is 25m from first commit to first post on HN, with account creation in between.


I think what a larger sample size would do would be to help capture changes over time. Humans tend to be more active certain times of days, whereas bots don't tend to do that.

I have sent them an email a few days ago about the state of /noobcomments.

This wasn't really a intended as an "wow, dang is sure sleeping on the job", more than an interesting observation on the new bot ecosystem.

I also feel like there's a missing discussion about the comment quality on HN lately. It feels like it's dropped like crazy. Wanted to see if I could find some hard data to show I haven't gone full Terry Davis.


Are you saying new accounts are 10x more likely to be using macs? That would be quite a thesis.

Haha, the code counts the number of comments with em-dashes and similar, not the number of em-dashes total.

Could be an argument made for aggregating by user instead however, if some bots are found to be particularly active and skewing the data.


> Haha, the code counts the number of comments with em-dashes and similar

Shhh!

:)


Don’t —

mind — me.

Don’t — me bro

Sounds like a good slogan/motto for the AIpocalypse resistance to use.

You missed the chance to use an em dash in your username!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: