I’m not sure I would call this a failure.. more just something you tried out of curiosity and abandoned. Happens to literally everyone. “Failed” to me would imply there was something fundamentally broken about the approach or the dataset, or that there was an actual negative impact to the unrealized result. It’s very hard to finish long-running side projects that aren’t generating income, attention, or driven by some quasi-pathological obsession. The fact you even blogged about it and made HN front page qualifies as a success in my book.
> If I would have finished the project, this dataset would then have been released and used for a number of analyses using Python.
Nothing stopping you from releasing the raw dataset and calling it a success!
> Back then, I would have trained a specialised model (or used a pretrained specialised model) but since LLMs made so much progress during the runtime of this project from 2020-Q1 to 2024-Q4, I would now rather consider a foundational model wrapped as an AI agent instead; for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.
I actually just started (and subsequently —-abandoned—- paused) my own news analysis side project leveraging LLMs for consolidation/aggregation.. and yeah, the web scraping part is still the worst. And I’ve had the same thought that feeding raw HTML to the LLM might be an easier way of parsing web objects now. The problem is most sites are privy to scraping efforts and it’s not so much a matter of finding the right element but bypassing the weird click-thru screens, tricking the site that you’re on a real browser, etc…
Personally, I think it's helpful to feel disappointment and insufficiency when those emotions pop up. They are the voices of certain preferences, needs, and/or desires that work to enrich our lives. Recontextualizing the world into some kind of positive success story can often gaslight those emotions out of existence, which can, paradoxically, be self-sabotoging.
The piece reads to me like a direct and honest confrontation with failure. It means the author thinks they can do better and is working to identify unhelpful subconscious patterns and overcome them.
Personally, I found the author's laser focus on "data science projects" intriguing. I have a tendency to immediately go meta which biases towards eliding detail; however, even if overly narrow, the author's focus does end up precipitating out concrete, actionable hypotheses for improvement.
> Nothing stopping you from releasing the raw dataset and calling it a success!
Right. OP: release it as a Kaggle Dataset (https://www.kaggle.com/datasets) and invite people to collaboratively figure out how to autonate the analyses. (Do you just want to get sentiment on a specific topic (e.g. vaccination, German energy supplies, German govt approval)? or quantitative predictions?) Start with something easy.
> for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.
Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"
> and yeah, the web scraping part is still the worst.
Sounds wrong. OP, fix your scraping. (unless it was anti-AI heuristics that kept breaking it, which I doubt since it's Tagesschau). But Tagesschau has RSS feeds, so why are you blocked on scraping? https://www.tagesschau.de/infoservices/rssfeeds
I'll put a shoutout for https://zenodo.org/ and https://figshare.com/ as places to put your data, where you'll get a DOI and can let someone that's not a company look after hosting and backing it up. Zenodo is hosted as long as CERN is around (is the promise) and figshare is backed by the CLOCKSS archive (multiple geographically distributed universities).
Google acquired Kaggle in 2017, and also Appen acquired Figure Eight (formerly CrowdFlower) in 2019, both of which used to be open-source-friendly places to post datasets for useful comments/analyses/crowdsourced hacking, in general without heavy and restrictive license terms. (There is also still the UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/). Kaggle still may be, just beware of the following:
Kaggle at some point began silently disappearing some (commercial) datasets from useful old competitions (such as dunnhumby's Shopping Challenge 2011 [0], even though it was anonymized and only had three features). So you can't rely on the more commercial datasets being around to cite and for replicability.
Also, according to [1] "you can be banned on Kaggle without any warnings or reasons, all your kernels and datasets will became inaccessible even for downloading for yourself and support will not answer you for weeks (if ever)".
Usually IME I'd heard it's on (AI-based) suspicion of cheating (or using multiple accounts to bypass submission limits, or collusion between teams on submissions), or post-2018 gaming and account-warming/transfer to boost rankings. But the AI might do false-positives, and it's reportedly nearly impossible to reach live human support.
Kaggle added DOIs in 2019 [2], at least for academic datasets, not by default.
"The data collection process involved a daily ritual of manually visiting the Tagesschau website to capture links"
I don't know what to say... I'm amazed they kept this up so long, but this really should never have been the game plan.
I also had some data science hobby projects around covid; I got busy, lost interest after 6 months. But the scrapers keep running in the cloud, in case I get motivated again (anyone need structured data on eBay listings for laptops since 2020?), that's the beauty of automation for these sorts of things.
I'm not the person you're asking, but I maintain a number of scraping projects. The bills are negligible for almost everything. A single $3/mo VPS can easily handle 1M QPS (enough for all the small projects put together), and most of these projects only accumulate O(10GB)/yr.
Doing something like grabbing hourly updates of the inventory of every item in every Target store is a bit more involved, and you'll rapidly accumulate proxy/IP/storage/... costs, but 99% of these projects have more valuable data at a lesser scale, and it's absolutely worth continuing them on average.
Inbound data is typically free on cloud VMs. CPU/RAM usage is also small unless you use chromedriver and scrape using an entire browser with graphics rendered on CPU. We're taking $5/mo for most scraping projects
Best practice (for many reasons) is to separate scraping (and OCR) and store the rawtext or raw HTML/JS, and also the parsed intermediate result (cleaned scraped text or HTML, with all the useless parts/tags removed). This is then the input to the rest of the pipeline. You really want to separate those, both for minimizing costs, and preventing breakage when site format changes, anti-scraping heuristics change, etc. And not exposing garbage tags to AI saves you time/money.
The author’s right about storytelling from day one, but then immediately throws cold water on the idea by saying it would have been a bad fit for this project.
This feels in error, as the big value of seeking feedback and results early and often on a project is that it forces you to confront whether you’re going to want or be able to tell stories in the space at all. It also gives you a chance to re-kindle waning interests, get feedback on your project by others, and avoid ratholing into something for about 5 years without having to engage with a public.
If a project can’t emotionally bear day one scrutiny, it’s unlikely to fare better five years later when you’ve got a lot of emotions about incompleteness and the feeling your work isn’t relevant anymore tied up in the project.
Thinking Fast and Slow is a result of some 20 years of regularly publishing and talking about those ideas with others.
Most really memorable works fit that same mold if you look carefully. An author spends years, even decades, doing small scale things before one day they put it all together into a big thing.
Comedy specials are the same. Develop material in small scale live with an audience, then create the big thing out of individual pieces that survive the process.
Hamming also talks about this as door open vs door closed researchers in his famous You And Your Research essay
The title seems misleading. Unless I'm missing something, all he did was scrape a news feed, which should only require a couple days of work to set up.
The fact that he left it running for years without finding the time to do anything with the data isn't that interesting.
What I love about projects like this is they are dynamic enough to cover a number of interests all in one.
I personally have some side projects that have started as X, transitioned into Y and Z, and then I stole some ideas and built A, which turned to B, which a requirement in my professional job necessitated the Z solution mixed with the B solution and resulted in something else which re-ignited my interest in X and helped me rebuild with a more clear mindset on what I intended in the first place.
All that to say, these things are dynamic and a long list of "failed" projects is a historical narrative of learning and interests over time. I love to see it.
How many people have spent 4+ years on a Thesis then just completely gave up, tired, drained, no interest in continuing. The bright eye'd bushy tailed wonder, all gone.
Nice article OP. I and a great many others suffer from the same struggles of bringing personal projects to “completion”, and I’ve gotta respect the resilience in the length of time you hung in there.
However, not to be overly pedantic, but I always felt “data science” was an exploratory exercise to discover insights into a given data set. I always personally filed the efforts to create the pipeline and associated automation (i.e. identify, capture, and store a given data set - more commonly referred to as “ETL”) as a “data engineering” task, which these days is considered a different specialty.
Perhaps if you scope your problem a little smaller, you may yet be able to capture something demonstrably valuable to others (and something you might consider “finished”). You’d be surprised how simple something that addresses a real issue can be to be able to provide real value for others.
> The data collection process involved a daily ritual of manually visiting the Tagesschau website to capture links to both the COVID and later Ukraine war newstickers. While this manual approach constituted the bulk of the project’s effort, it was necessitated by Tagesschau’s unstructured URL schema, which made automated link collection impractical.
> The emphasis on preserving raw HTML proved vital when Tagesschau repeatedly altered their newsticker DOM structure throughout Q2 2020.
Another big takeaway is that it's not sustainable to rely on this type of a data source. Your data source should be stable. If the site offers API's, that's almost always better than parsing html.
Website developers do not consider scrapers when they make changes. Why would they? So if you are ever trying to collect some unique dataset, it doesn't hurt to reach out to the web devs to see if they can provide a public API.
Please consider it an early Christmas present to yourself if you can pay a nominal amount for an API instead of spending your time scraping unless you enjoy doing the scraping.
Why not open source? I've been slaving away at some possibly pointless data scraping sites that collect app data and the SDKs that apps use. I figure if I at least open source it that data and code is there for others to use.
I see some recommendations about running a small version of the analysis first to see if it's going to work at all. I agree, and the next level up is to also estimate the value of performing the full analysis. I.e. not just whether or not it will work at all, but how much it is allowed to cost and still be useful.
You may find, for example, that each unit of uncertainty reduced costs more than the value of the corresponding uncertainty reduction. This is the point at which one needs to either find a new approach, or be content with the level of uncertainty one has.
I think whether you 'succeed' or 'fail' on a side project they are still valuable. No matter if you can't finish it or it turns out different to how you imagined -- you get to come away as a better version of yourself. A person who is more optimized for a new strategy. And sometimes 'failure' is a worthwhile price for that ability. Who knows, it might be exactly what prepares you for something even bigger in the future.
I guess the kind of extreme effort that doesn't usually have a promising conclusion is more common in scientific research, or experimentation in general, but sometimes you just have to get accustomed to it.
Eventually it doesn't really make any difference if there's no breathtaking milestone because it turned out to be impossible by nature, ran out of runway, or lost interest after a more or less valiant attempt.
What can be gained is the strength to overcome the near-impossible next time and all it has to do is be a certain degree less-impossible and you know whether that would take you over the goal line like few others because you've been there.
Without even worrying as much about whether you will lose interest or not, that's a lot less stress and pressure when you think about it.
This can enable you more realistically to succeed in other areas where peers may find it impossible or not be able to do as well without as big an inconclusive project behind them.
> Store raw data if possible. This allows you to condense it later.
I have some daily scripts reading from an http endpoint, and I can't really decide what to do when it returns html instead of json. Should I store the HTML as it is "raw data" or should I just dismiss it? The API in question has a tendency to return 200 with a webpage saying that the API can't be reached (typically because of a time out)
I wouldn't store that usually, I'd use that to trigger retries.
For you storing the raw data is storing the json that http endpoint returns rather than something like
let content = get(url).json()
info_i_care_about = content['data']['title']
store(info_i_care_about)
as otherwise you'll get stuck when the json response moves the title to data.metadata.title or whatever
It's usually less of an issue with structured data, things like html change more often, but keeping that raw data means you can process it in various different ways later.
You also decouple errors so your parsing error doesn't stop your write from happening.
I know the feeling. I managed 9 months scraping supermarket data before I gave up mostly because a few other people were doing it and I was short on time.
I for one don't want to start counting everything I lose interest in as a "failure", that would be too depressing. I actually think this is a feature not a flaw. You have very few attention tokens and should be aggressive in getting them back.
I think this is very different from the "finishing" decision. That should focus on scope and iterations, while attempting to account for effort vs. reward and avoiding things like sunk cost influences.
Combine both and you've got "pragmatic grit": the ability to get valuable shit done.
I still don't understand what he even tried to do? So he manually collected news articles for a few years without any plan on what to do with them so what? Where is the project? Honestly he could probably just have asked the tagesschau people and they would have given him their archive.
The learning from this seems to be: collecting data and never doing anything with it is not a worthy project
1. The title makes it sound like the author spent a lot of time on this project. But really, this mostly consisted of noting down a couple of URLs per day. So maybe 5 min / day = ~130h spent on the project. Let's say 200h to be on the safe side.
2. "Get first analyses results out quickly based on a small dataset and don’t just collect data up front to “analyse it later”" => I think this actually killed the project. Collecting data for several years w/o actually doing anything doesn't with it is not a sound project.
3. "If I would have finished the project, this dataset would then have been released" ==> There is literally nothing stopping OP from still doing this. It costs maybe 2h of work and would potentially give a substantial benefit to others, i.e., turn this project into a win after all. I'm very puzzled why OP didn't do this.
My wife only did 50 hours of duolingo in total the past 2 years. Combine that with me teasing her in Dutch and she’s actually making progress.
Duolingo is a chill tool to learn some vocab. That vocab then gets acquired by talking to me. We talk 2 minutes Dutch per day at most. So about 11 hours in total per year.
She is 67% done with duolingo. So we bought the first real book to learn Dutch (De Opmaat).
That book is IMO not for pure beginners. But for the level my wife was at, it seems perfect.
Human speech is around 150-200 words per minute; even going slow, 2 minutes a day of real talk is probably more vocab than 10 minutes of Duo. And with better feedback, a human rather than a cartoon casino.
I finished the whole tree in French and had nothing to show for it either. It really is a fun way to feel like you're learning, without connecting you to the language or culture in any significant way.
For me - nothing beats in-person classes in lieu of a native speaker whom you can interact with. Being forced to actually speak the language in “mock settings” makes all the difference.
And even if you don’t get your grammar completely right, you will learn enough to survive in a real-life setting.
I learned Spanish through a combination of both - I took Spanish classes after I started dating my Mexican wife, enough to get conversational. Then I started interacting in Spanish with her family, which helps me now maintain the language without needing the classes.
I feel this whilst learning (trying to) German: when I think "how I would say this in German?" I got nothing less than a blank on my mind. But I'm a good "speaker" though, and sadly, I feel I'm not going anywhere as well...
Watch Dark on Netflix in original German on repeat, great way to subconsciously make note of tones and pronunciation while also watching an awesome show. Be very intentional about it though.
Surround yourself in the language. In Germany we have almost everything dubbed, so you can watch pretty much any popular movie or TV series in German or read any popular book in German. Besides that there are also quite a lot of German productions.
Wow I forgot about that! When I was using it for French many years ago, I imagined they were using it as a way to get generate free translations, but still found it enjoyable and useful.
Point number 2. is super important for non-hobby projects. Collect a bit of data, even if you have to do it manually at first and do a "dry run" / first cut of whatever analysis you're thinking of doing so you confirm you're actually collecting what you need and what you're doing is even going to work. Seeing a pipeline get built, run for like two months and then the data scientist come along and say "this isn't what we needed" was complete goddamn shitshow. I'm just glad I was only a spectator to it.
They touch on something relevant here and it's a great point to emphasise
> The emphasis on preserving raw HTML proved vital when Tagesschau repeatedly altered their newsticker DOM structure throughout Q2 2020. This experience underscored a fundamental data engineering principle: raw data is king. While parsers can be rewritten, lost data is irretrievable.
I've done this before keeping full, timestamped, versioned raw HTML. That still risks shifts to javascript based things but keeping your collection and processing distinct as much as you can so you can rerun things later is incredibly helpful.
Usually, processing raw data is cheap. Recovering raw data is expensive or impossible.
As a bonus, collecting raw data is usually easier than collecting and processing it, so you might as well start there. Maybe you'll find out you were missing something, but it's no worse than if you'd tied things together.
edit
> Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"
They say they had to manually find the links to the right liveblog subpage. So they had to go to the main page, find the link and then store it.
While I understand the points I think it's worth being kinder about someone coming out to write about how they failed with a project.
> 1. The title makes it sound like the author spent a lot of time on this project. But really, this mostly consisted of noting down a couple of URLs per day. So maybe 5 min / day = ~130h spent on the project. Let's say 200h to be on the safe side.
Consistent work over multiple years shouldn't be looked down on like this. If you've done something every day for years it's still a lot of time in your life. We're not econs and so I don't think summing up the time really captures it either.
> 3. "If I would have finished the project, this dataset would then have been released" ==> There is literally nothing stopping OP from still doing this. It costs maybe 2h of work and would potentially give a substantial benefit to others, i.e., turn this project into a win after all. I'm very puzzled why OP didn't do this.
They might not realise how to do this sustainably, they might be mentally just done with it. It may be harder for them to think about.
I'd recommend also that they release the data. If they put it on either Zenodo or Figshare it'll be hosted for free and referenceable by others.
> 2. "Get first analyses results out quickly based on a small dataset and don’t just collect data up front to “analyse it later”" => I think this actually killed the project.
I agree, but again on the kinder side (because they also agree I think) there are multiple reasons for doing this and focusing on why might be more productive.
1. It gets you to actually process the data in some useful form. So many times I've seen things fail late on because people didn't realise something like "how are dates formatted" or whether some field was often missing or you just didn't capture something that turns out to be pretty key (e.g. scrape times then realise that at some point they changed it to "two weeks ago" and you didn't realise).
This can be as simple as just plotting some data, counting uniques, anything. The automated system will fall over when things go wrong and you can check it.
2. What do people care about? What do you care about? Sometimes I've had a great idea for an analysis only to realise later maybe I'm the only one that cares or worse, the result is so obvious it's not even interesting to me.
3. Keeping interest. Keeping interest in a multi-year project that's giving you something back can be easier than something that's just taking.
4. Guilt. If I spend a long time on something, I feel it should be better. So I want to make it more polished, which takes time, which I don't have. So I don't add to it, then I'm not adding anything, then nothing happens. It shouldn't matter, but I've long realised that just wishing my mind worked differently isn't a good plan and instead I should just plan for reality. For that, doing something fast feels much better - I am happier releasing something that's taken me half a day and looks kinda-ok because
5. Get it out before something changes. COVID had or has no upfront endpoint.
6. Ensure you've actually got a plan. Unless you've got a very good reason, you can probably build what you need to analyse things and release it earlier. You can't run an analysis on an upcoming election, but even then you could do it on a previous year and see things working. This can help with motivation because at the end you don't have "oh right now I need to write and run loads of things" you just need to hit go again.
> If I would have finished the project, this dataset would then have been released and used for a number of analyses using Python.
Nothing stopping you from releasing the raw dataset and calling it a success!
> Back then, I would have trained a specialised model (or used a pretrained specialised model) but since LLMs made so much progress during the runtime of this project from 2020-Q1 to 2024-Q4, I would now rather consider a foundational model wrapped as an AI agent instead; for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.
I actually just started (and subsequently —-abandoned—- paused) my own news analysis side project leveraging LLMs for consolidation/aggregation.. and yeah, the web scraping part is still the worst. And I’ve had the same thought that feeding raw HTML to the LLM might be an easier way of parsing web objects now. The problem is most sites are privy to scraping efforts and it’s not so much a matter of finding the right element but bypassing the weird click-thru screens, tricking the site that you’re on a real browser, etc…