Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not a torrent or a full solution but applying the regex /wiki/(?!.\:)[aA-zZ0-9%()_] on the source should select all the articles (along with some generic wikipedia links matched at the bottom), then batch adding "https://en.wikipedia.org" to the beginning of each line gives full urls.

Here's one such list: https://hastebin.com/terugezeda

wget has an option (-i) to download links line-by-line from a text file but is sadly making a mess of the images, using

  wget --span-hosts --convert-links --adjust-extension --page-requisites --no-host-directories --no-parent --wait=1 --reject="robots.txt" -i wget.txt 
or

  wget -H -k -E -p -nH -np -w 1 -R "robots.txt" -i wget.txt
for short.

Maybe someone has a better idea for the last step

edit: shorthand version



I'd recommend inliner for the last step: https://www.npmjs.com/package/inliner




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: