Ask HN: Training a model on all HN data?

minimaxir · 2025-03-20T19:12:45 1742497965

It's relatively straightforward to download all HN submissions/comments via BigQuery and then finetune an LLM, there's just not much point to it.

You can safely assume all modern LLMs have been trained in part on HN data.

anigbrowl · 2025-03-20T19:18:08 1742498288

HN was part of the training set for ChatGPT. But it might be interesting to train/fine tune on HN alone. You could weight by karma or conversely you might identify shortcomings in the karma system.

minimaxir · 2025-03-20T19:22:45 1742498565

Comment vote data is not public, which is the data you would need to make such a system useful.

pavel_lishin · 2025-03-20T19:59:20 1742500760

To what end?