Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Bots. Spam ruins it for everyone. Cloudflare, for example, provide an analytics platform based off of logs, but it overreports visits by a huge amount for most cases because javascript is a good filter for most bots.

Here's Cloudflare's page on why there is a discrepancy between their numbers and GA-type services: https://support.cloudflare.com/hc/en-us/articles/36003768411...

And here's a blog post by a self-hosted-analytics provider who shows exact numbers from his personal blog: https://markosaric.com/cloudflare-analytics-review/

He has a stake in the matter, but it's also such a huge discrepancy that I'm guessing is hard to fake. Anecdotally, friends I have with small blogs mirror this experience.



Bot activities in web server logs isn't that hard to spot. They often don't use "regular" user agents, they usually only download a page's HTML without getting all the images and media, they often do a GET on /robots.txt before proceeding, they manifest unrealisticly short times between browsing a page and going to the next one, and they often come from known IP subnets. My simple log parsing heuristic can already remove 90% of the bot traffic from the statistics.


GA also underreports because people block it. Might not be relevant for most sites but if your audience is tech-related don't expect it to be irrelevant.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: