Does the inferred "topic" of the domain match the topic of the individual pages? If not -> manual review. And there are many more indicators.
Hire a bunch of student jobbers, have them search github for tarpits, and let them write middleware to detect those.
If you are doing broad crawling, you already need to do this kind of thing anyway.
Do people still do this, or do they just off shore the task?
Does the inferred "topic" of the domain match the topic of the individual pages? If not -> manual review. And there are many more indicators.
Hire a bunch of student jobbers, have them search github for tarpits, and let them write middleware to detect those.
If you are doing broad crawling, you already need to do this kind of thing anyway.