1,600 days of a failed hobby data science project

plaidfuji · 2024-12-08T22:33:40 1733697220

I’m not sure I would call this a failure.. more just something you tried out of curiosity and abandoned. Happens to literally everyone. “Failed” to me would imply there was something fundamentally broken about the approach or the dataset, or that there was an actual negative impact to the unrealized result. It’s very hard to finish long-running side projects that aren’t generating income, attention, or driven by some quasi-pathological obsession. The fact you even blogged about it and made HN front page qualifies as a success in my book.

> If I would have finished the project, this dataset would then have been released and used for a number of analyses using Python.

Nothing stopping you from releasing the raw dataset and calling it a success!

> Back then, I would have trained a specialised model (or used a pretrained specialised model) but since LLMs made so much progress during the runtime of this project from 2020-Q1 to 2024-Q4, I would now rather consider a foundational model wrapped as an AI agent instead; for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.

I actually just started (and subsequently —-abandoned—- paused) my own news analysis side project leveraging LLMs for consolidation/aggregation.. and yeah, the web scraping part is still the worst. And I’ve had the same thought that feeding raw HTML to the LLM might be an easier way of parsing web objects now. The problem is most sites are privy to scraping efforts and it’s not so much a matter of finding the right element but bypassing the weird click-thru screens, tricking the site that you’re on a real browser, etc…

xelxebar · 2024-12-09T01:57:27 1733709447

Personally, I think it's helpful to feel disappointment and insufficiency when those emotions pop up. They are the voices of certain preferences, needs, and/or desires that work to enrich our lives. Recontextualizing the world into some kind of positive success story can often gaslight those emotions out of existence, which can, paradoxically, be self-sabotoging.

The piece reads to me like a direct and honest confrontation with failure. It means the author thinks they can do better and is working to identify unhelpful subconscious patterns and overcome them.

Personally, I found the author's laser focus on "data science projects" intriguing. I have a tendency to immediately go meta which biases towards eliding detail; however, even if overly narrow, the author's focus does end up precipitating out concrete, actionable hypotheses for improvement.

Bravo, IMHO.

smcin · 2024-12-09T01:34:06 1733708046

> Nothing stopping you from releasing the raw dataset and calling it a success!

Right. OP: release it as a Kaggle Dataset (https://www.kaggle.com/datasets) and invite people to collaboratively figure out how to autonate the analyses. (Do you just want to get sentiment on a specific topic (e.g. vaccination, German energy supplies, German govt approval)? or quantitative predictions?) Start with something easy.

> for example, I would try to find a foundation model to do the job of for example finding the right link on the Tagesschau website, which was by far the most draining part of the whole project.

Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"

> and yeah, the web scraping part is still the worst.

Sounds wrong. OP, fix your scraping. (unless it was anti-AI heuristics that kept breaking it, which I doubt since it's Tagesschau). But Tagesschau has RSS feeds, so why are you blocked on scraping? https://www.tagesschau.de/infoservices/rssfeeds

Compare to: Kaggle Datasets "10k German News Articles for topic classification", Schabus, Skowron Trspp, SIGIR 2017 [https://www.kaggle.com/datasets/abhishek/10k-german-news-art...]

IanCal · 2024-12-09T11:09:40 1733742580

I'll put a shoutout for https://zenodo.org/ and https://figshare.com/ as places to put your data, where you'll get a DOI and can let someone that's not a company look after hosting and backing it up. Zenodo is hosted as long as CERN is around (is the promise) and figshare is backed by the CLOCKSS archive (multiple geographically distributed universities).

smcin · 2024-12-11T03:41:56 1733888516

Right.

Google acquired Kaggle in 2017, and also Appen acquired Figure Eight (formerly CrowdFlower) in 2019, both of which used to be open-source-friendly places to post datasets for useful comments/analyses/crowdsourced hacking, in general without heavy and restrictive license terms. (There is also still the UC Irvine Machine Learning Repository, https://archive.ics.uci.edu/). Kaggle still may be, just beware of the following:

Kaggle at some point began silently disappearing some (commercial) datasets from useful old competitions (such as dunnhumby's Shopping Challenge 2011 [0], even though it was anonymized and only had three features). So you can't rely on the more commercial datasets being around to cite and for replicability.

Also, according to [1] "you can be banned on Kaggle without any warnings or reasons, all your kernels and datasets will became inaccessible even for downloading for yourself and support will not answer you for weeks (if ever)". Usually IME I'd heard it's on (AI-based) suspicion of cheating (or using multiple accounts to bypass submission limits, or collusion between teams on submissions), or post-2018 gaming and account-warming/transfer to boost rankings. But the AI might do false-positives, and it's reportedly nearly impossible to reach live human support.

Kaggle added DOIs in 2019 [2], at least for academic datasets, not by default.

[0]: https://www.kaggle.com/c/dunnhumbychallenge

[1]: https://www.reddit.com/r/kaggle/comments/essuk1/reminder_you...

[2]: https://www.kaggle.com/discussions/product-feedback/108594

mNovak · 2024-12-08T22:43:10 1733697790

"The data collection process involved a daily ritual of manually visiting the Tagesschau website to capture links"

I don't know what to say... I'm amazed they kept this up so long, but this really should never have been the game plan.

I also had some data science hobby projects around covid; I got busy, lost interest after 6 months. But the scrapers keep running in the cloud, in case I get motivated again (anyone need structured data on eBay listings for laptops since 2020?), that's the beauty of automation for these sorts of things.

plaidfuji · 2024-12-08T22:45:20 1733697920

Do you just pay the bill for the resources indefinitely?

hansvm · 2024-12-08T22:59:56 1733698796

I'm not the person you're asking, but I maintain a number of scraping projects. The bills are negligible for almost everything. A single $3/mo VPS can easily handle 1M QPS (enough for all the small projects put together), and most of these projects only accumulate O(10GB)/yr.

Doing something like grabbing hourly updates of the inventory of every item in every Target store is a bit more involved, and you'll rapidly accumulate proxy/IP/storage/... costs, but 99% of these projects have more valuable data at a lesser scale, and it's absolutely worth continuing them on average.

NavinF · 2024-12-09T00:58:45 1733705925

Inbound data is typically free on cloud VMs. CPU/RAM usage is also small unless you use chromedriver and scrape using an entire browser with graphics rendered on CPU. We're taking $5/mo for most scraping projects

mNovak · 2024-12-09T19:59:12 1733774352

I paying < $0.50 a month, and that's primarily driven by S3. For the scraping itself I'm using lambda, with maybe minutes of runtime per day.

dankwizard · 2024-12-08T23:39:51 1733701191

I don't speak the language so maybe what you're scraping isn't in this list, but why manual when they seem to have comprehensive RSS feeds? [1]

Automating this part should have been day 1.

[1] https://www.tagesschau.de/infoservices/rssfeeds

smcin · 2024-12-09T01:35:18 1733708118

That's what I just concluded. I think the OP was oversold on the idea of using AI to do scraping, NLP and summarization, all in one go.

smcin · 2024-12-11T03:46:05 1733888765

Best practice (for many reasons) is to separate scraping (and OCR) and store the rawtext or raw HTML/JS, and also the parsed intermediate result (cleaned scraped text or HTML, with all the useless parts/tags removed). This is then the input to the rest of the pipeline. You really want to separate those, both for minimizing costs, and preventing breakage when site format changes, anti-scraping heuristics change, etc. And not exposing garbage tags to AI saves you time/money.

fardo · 2024-12-08T21:58:58 1733695138

The author’s right about storytelling from day one, but then immediately throws cold water on the idea by saying it would have been a bad fit for this project.

This feels in error, as the big value of seeking feedback and results early and often on a project is that it forces you to confront whether you’re going to want or be able to tell stories in the space at all. It also gives you a chance to re-kindle waning interests, get feedback on your project by others, and avoid ratholing into something for about 5 years without having to engage with a public.

If a project can’t emotionally bear day one scrutiny, it’s unlikely to fare better five years later when you’ve got a lot of emotions about incompleteness and the feeling your work isn’t relevant anymore tied up in the project.

rixed · 2024-12-08T22:11:08 1733695868

Would you be able to recommend a project whom author did engage in such public story telling from early on?

Swizec · 2024-12-08T22:22:12 1733696532

Thinking Fast and Slow is a result of some 20 years of regularly publishing and talking about those ideas with others.

Most really memorable works fit that same mold if you look carefully. An author spends years, even decades, doing small scale things before one day they put it all together into a big thing.

Comedy specials are the same. Develop material in small scale live with an audience, then create the big thing out of individual pieces that survive the process.

Hamming also talks about this as door open vs door closed researchers in his famous You And Your Research essay

rjrdi38dbbdb · 2024-12-08T22:29:07 1733696947

The title seems misleading. Unless I'm missing something, all he did was scrape a news feed, which should only require a couple days of work to set up.

The fact that he left it running for years without finding the time to do anything with the data isn't that interesting.

amelius · 2024-12-08T23:14:49 1733699689

Yes, his #1 advice should be "do something with the data you collected".

TheGoodBarn · 2024-12-09T16:35:28 1733762128

What I love about projects like this is they are dynamic enough to cover a number of interests all in one.

I personally have some side projects that have started as X, transitioned into Y and Z, and then I stole some ideas and built A, which turned to B, which a requirement in my professional job necessitated the Z solution mixed with the B solution and resulted in something else which re-ignited my interest in X and helped me rebuild with a more clear mindset on what I intended in the first place.

All that to say, these things are dynamic and a long list of "failed" projects is a historical narrative of learning and interests over time. I love to see it.

FrustratedMonky · 2024-12-08T23:20:16 1733700016

"Data Science Project Failing After 1,600 Days"

Sounds like my Thesis.

How many people have spent 4+ years on a Thesis then just completely gave up, tired, drained, no interest in continuing. The bright eye'd bushy tailed wonder, all gone.

sota_pop · 2024-12-09T16:58:47 1733763527

Nice article OP. I and a great many others suffer from the same struggles of bringing personal projects to “completion”, and I’ve gotta respect the resilience in the length of time you hung in there. However, not to be overly pedantic, but I always felt “data science” was an exploratory exercise to discover insights into a given data set. I always personally filed the efforts to create the pipeline and associated automation (i.e. identify, capture, and store a given data set - more commonly referred to as “ETL”) as a “data engineering” task, which these days is considered a different specialty. Perhaps if you scope your problem a little smaller, you may yet be able to capture something demonstrably valuable to others (and something you might consider “finished”). You’d be surprised how simple something that addresses a real issue can be to be able to provide real value for others.

Nice work and great effort.

rybosworld · 2024-12-09T12:45:00 1733748300

> The data collection process involved a daily ritual of manually visiting the Tagesschau website to capture links to both the COVID and later Ukraine war newstickers. While this manual approach constituted the bulk of the project’s effort, it was necessitated by Tagesschau’s unstructured URL schema, which made automated link collection impractical.

> The emphasis on preserving raw HTML proved vital when Tagesschau repeatedly altered their newsticker DOM structure throughout Q2 2020.

Another big takeaway is that it's not sustainable to rely on this type of a data source. Your data source should be stable. If the site offers API's, that's almost always better than parsing html.

Website developers do not consider scrapers when they make changes. Why would they? So if you are ever trying to collect some unique dataset, it doesn't hurt to reach out to the web devs to see if they can provide a public API.

abirch · 2024-12-09T17:37:49 1733765869

Please consider it an early Christmas present to yourself if you can pay a nominal amount for an API instead of spending your time scraping unless you enjoy doing the scraping.

sshrajesh · 2024-12-09T18:38:31 1733769511

Anyone knows what software is used to create these diagrams: https://lellep.xyz/blog/images/failed_data_science_project/2...

tvrg · 2024-12-09T18:42:27 1733769747

Looks like something you could create with excalidraw. It's an awesome tool!

https://excalidraw.com/

regular_trash · 2024-12-09T18:41:21 1733769681

Excalidraw

ddxv · 2024-12-09T02:48:15 1733712495

Why not open source? I've been slaving away at some possibly pointless data scraping sites that collect app data and the SDKs that apps use. I figure if I at least open source it that data and code is there for others to use.

kqr · 2024-12-09T05:18:52 1733721532

I see some recommendations about running a small version of the analysis first to see if it's going to work at all. I agree, and the next level up is to also estimate the value of performing the full analysis. I.e. not just whether or not it will work at all, but how much it is allowed to cost and still be useful.

You may find, for example, that each unit of uncertainty reduced costs more than the value of the corresponding uncertainty reduction. This is the point at which one needs to either find a new approach, or be content with the level of uncertainty one has.

Uptrenda · 2024-12-09T13:56:28 1733752588

I think whether you 'succeed' or 'fail' on a side project they are still valuable. No matter if you can't finish it or it turns out different to how you imagined -- you get to come away as a better version of yourself. A person who is more optimized for a new strategy. And sometimes 'failure' is a worthwhile price for that ability. Who knows, it might be exactly what prepares you for something even bigger in the future.

fuzzfactor · 2024-12-09T18:06:29 1733767589

I guess the kind of extreme effort that doesn't usually have a promising conclusion is more common in scientific research, or experimentation in general, but sometimes you just have to get accustomed to it.

Eventually it doesn't really make any difference if there's no breathtaking milestone because it turned out to be impossible by nature, ran out of runway, or lost interest after a more or less valiant attempt.

What can be gained is the strength to overcome the near-impossible next time and all it has to do is be a certain degree less-impossible and you know whether that would take you over the goal line like few others because you've been there.

Without even worrying as much about whether you will lose interest or not, that's a lot less stress and pressure when you think about it.

This can enable you more realistically to succeed in other areas where peers may find it impossible or not be able to do as well without as big an inconclusive project behind them.

wodenokoto · 2024-12-09T09:58:28 1733738308

> Store raw data if possible. This allows you to condense it later.

I have some daily scripts reading from an http endpoint, and I can't really decide what to do when it returns html instead of json. Should I store the HTML as it is "raw data" or should I just dismiss it? The API in question has a tendency to return 200 with a webpage saying that the API can't be reached (typically because of a time out)

IanCal · 2024-12-09T11:15:39 1733742939

I wouldn't store that usually, I'd use that to trigger retries.

For you storing the raw data is storing the json that http endpoint returns rather than something like

    let content = get(url).json()
    info_i_care_about = content['data']['title']
    store(info_i_care_about)

as otherwise you'll get stuck when the json response moves the title to data.metadata.title or whatever

It's usually less of an issue with structured data, things like html change more often, but keeping that raw data means you can process it in various different ways later.

You also decouple errors so your parsing error doesn't stop your write from happening.

j45 · 2024-12-09T00:30:53 1733704253

I don’t know that projects ever fail.

Doing them and learning and growing from them is the point.

They shed a light on your path and also what you are able to see as possible.

brikym · 2024-12-09T07:11:32 1733728292

I know the feeling. I managed 9 months scraping supermarket data before I gave up mostly because a few other people were doing it and I was short on time.

jfil · 2024-12-10T13:28:26 1733837306

What country's data did you scrape? Do you make it available somewhere?

I'm on month 10 of scraping Canadian grocer data and make it available publicly at https://jacobfilipp.com/hammer/

KeplerBoy · 2024-12-09T12:34:46 1733747686

Oh boy, the topic (Covid) alone would have left me exhausted after a few months. I heard enough of it by mid 2021.

CRConrad · 2024-12-11T10:17:21 1733912241

Did you misspell 2020...?

ComodoHacker · 2024-12-10T06:26:09 1733811969

Turns out it wasn't nearly as much as 1600 days of labor. So, clickbait headline.

barrenko · 2024-12-09T08:59:57 1733734797

People relatively new to CS would be wise to be warned about what a colossal time sink it is.

CRConrad · 2024-12-11T10:24:49 1733912689

Yeah, my kid wastes far too much time on CounterStrike.

tessierashpool9 · 2024-12-09T11:23:29 1733743409

the last thing the world or rather germany needs is a news ticker based on ... the tagesschau LOL

dowager_dan99 · 2024-12-09T20:55:27 1733777727

I for one don't want to start counting everything I lose interest in as a "failure", that would be too depressing. I actually think this is a feature not a flaw. You have very few attention tokens and should be aggressive in getting them back.

I think this is very different from the "finishing" decision. That should focus on scope and iterations, while attempting to account for effort vs. reward and avoiding things like sunk cost influences.

Combine both and you've got "pragmatic grit": the ability to get valuable shit done.

buddybubble · 2024-12-09T12:49:13 1733748553

I still don't understand what he even tried to do? So he manually collected news articles for a few years without any plan on what to do with them so what? Where is the project? Honestly he could probably just have asked the tagesschau people and they would have given him their archive. The learning from this seems to be: collecting data and never doing anything with it is not a worthy project

querez · 2024-12-08T22:35:23 1733697323

Some very weird things in this.

1. The title makes it sound like the author spent a lot of time on this project. But really, this mostly consisted of noting down a couple of URLs per day. So maybe 5 min / day = ~130h spent on the project. Let's say 200h to be on the safe side.

2. "Get first analyses results out quickly based on a small dataset and don’t just collect data up front to “analyse it later”" => I think this actually killed the project. Collecting data for several years w/o actually doing anything doesn't with it is not a sound project.

3. "If I would have finished the project, this dataset would then have been released" ==> There is literally nothing stopping OP from still doing this. It costs maybe 2h of work and would potentially give a substantial benefit to others, i.e., turn this project into a win after all. I'm very puzzled why OP didn't do this.

apwell23 · 2024-12-08T23:25:15 1733700315

yep I spent more time on duolingo for 600+ day streak and can barely speak spanish.

xandrius · 2024-12-09T00:32:42 1733704362

Duolingo is a pretty bad tool for learning a language, it's good to make you feel like you're learning though.

waste_monk · 2024-12-09T04:20:07 1733718007

At this point it's more about being scared of the bird.

mettamage · 2024-12-09T13:03:08 1733749388

Just to give a nuanced perspective on duolingo.

My wife only did 50 hours of duolingo in total the past 2 years. Combine that with me teasing her in Dutch and she’s actually making progress.

Duolingo is a chill tool to learn some vocab. That vocab then gets acquired by talking to me. We talk 2 minutes Dutch per day at most. So about 11 hours in total per year.

She is 67% done with duolingo. So we bought the first real book to learn Dutch (De Opmaat).

That book is IMO not for pure beginners. But for the level my wife was at, it seems perfect.

ben_w · 2024-12-10T10:31:33 1733826693

Human speech is around 150-200 words per minute; even going slow, 2 minutes a day of real talk is probably more vocab than 10 minutes of Duo. And with better feedback, a human rather than a cartoon casino.

selimthegrim · 2024-12-09T21:12:08 1733778728

Do you think it would be good for Flemish too or speaking standard Dutch in Belgium?

mettamage · 2024-12-09T21:42:32 1733780552

I don't know how one would learn Flemish from books. I think you'd need to go to Belgium and speak Dutch there and then see what the differences are.

Dutch and Flemish are interchangeable though. Sometimes it falls apart based on accent, but not on language.

rjh29 · 2024-12-09T01:42:39 1733708559

I finished the whole tree in French and had nothing to show for it either. It really is a fun way to feel like you're learning, without connecting you to the language or culture in any significant way.

Wololooo · 2024-12-09T15:29:17 1733758157

It's a useful tool if you're immersed in the language, it's not key to your learning but it can tremendously help.

Insanity · 2024-12-09T06:50:52 1733727052

For me - nothing beats in-person classes in lieu of a native speaker whom you can interact with. Being forced to actually speak the language in “mock settings” makes all the difference.

And even if you don’t get your grammar completely right, you will learn enough to survive in a real-life setting.

I learned Spanish through a combination of both - I took Spanish classes after I started dating my Mexican wife, enough to get conversational. Then I started interacting in Spanish with her family, which helps me now maintain the language without needing the classes.

raister · 2024-12-09T04:43:47 1733719427

I feel this whilst learning (trying to) German: when I think "how I would say this in German?" I got nothing less than a blank on my mind. But I'm a good "speaker" though, and sadly, I feel I'm not going anywhere as well...

coffeecantcode · 2024-12-09T13:19:27 1733750367

Watch Dark on Netflix in original German on repeat, great way to subconsciously make note of tones and pronunciation while also watching an awesome show. Be very intentional about it though.

katzenversteher · 2024-12-09T05:45:57 1733723157

Surround yourself in the language. In Germany we have almost everything dubbed, so you can watch pretty much any popular movie or TV series in German or read any popular book in German. Besides that there are also quite a lot of German productions.

ben_w · 2024-12-09T08:51:15 1733734275

Indeed.

For learners, I'd also currently recommend "Easy German" podcasts and YouTube videos, as they come in all skill levels, are free, and are well made.

https://youtube.com/@easygerman?si=EQdZPHMZ0lPNEl6V

rrr_oh_man · 2024-12-08T23:46:43 1733701603

That seems to be a pattern

galleywest200 · 2024-12-09T00:00:22 1733702422

It is because you never really practice talking with Duolingo. I am quite good at reading French now, though.

pessimizer · 2024-12-09T00:49:33 1733705373

> I am quite good at reading French now, though.

If you are, that's actually quite an achievement and good. If you're talking about French outside of Duolingo, that is.

I do not normally hear of people getting to reading fluency through Duolingo.

wizzwizz4 · 2024-12-09T01:08:25 1733706505

Duolingo used to have a really good feature where you read through and collaboratively-translated texts, but they shut it down years back.

j_bum · 2024-12-09T04:13:35 1733717615

Wow I forgot about that! When I was using it for French many years ago, I imagined they were using it as a way to get generate free translations, but still found it enjoyable and useful.

Wonder why they took it away.

smcin · 2024-12-09T01:26:06 1733707566

Well you can't practice producing unconstrained sentences. Only with their very narrow training-wheels.

MarcelOlsz · 2024-12-09T02:56:23 1733712983

Anki is the way, especially with their new FSRS algo.

bowsamic · 2024-12-09T07:05:28 1733727928

Yep, any good textbook or course with Anki for aiding raw memorisation. By far the best way to go

ben_w · 2024-12-09T08:44:22 1733733862

Likewise, but also about that with Arabic on Duolingo and I never even mastered the alphabet.

morkalork · 2024-12-09T00:03:26 1733702606

Point number 2. is super important for non-hobby projects. Collect a bit of data, even if you have to do it manually at first and do a "dry run" / first cut of whatever analysis you're thinking of doing so you confirm you're actually collecting what you need and what you're doing is even going to work. Seeing a pipeline get built, run for like two months and then the data scientist come along and say "this isn't what we needed" was complete goddamn shitshow. I'm just glad I was only a spectator to it.

IanCal · 2024-12-09T11:03:08 1733742188

They touch on something relevant here and it's a great point to emphasise

> The emphasis on preserving raw HTML proved vital when Tagesschau repeatedly altered their newsticker DOM structure throughout Q2 2020. This experience underscored a fundamental data engineering principle: raw data is king. While parsers can be rewritten, lost data is irretrievable.

I've done this before keeping full, timestamped, versioned raw HTML. That still risks shifts to javascript based things but keeping your collection and processing distinct as much as you can so you can rerun things later is incredibly helpful.

Usually, processing raw data is cheap. Recovering raw data is expensive or impossible.

As a bonus, collecting raw data is usually easier than collecting and processing it, so you might as well start there. Maybe you'll find out you were missing something, but it's no worse than if you'd tied things together.

edit

> Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"

They say they had to manually find the links to the right liveblog subpage. So they had to go to the main page, find the link and then store it.

IanCal · 2024-12-09T10:59:57 1733741997

While I understand the points I think it's worth being kinder about someone coming out to write about how they failed with a project.

> 1. The title makes it sound like the author spent a lot of time on this project. But really, this mostly consisted of noting down a couple of URLs per day. So maybe 5 min / day = ~130h spent on the project. Let's say 200h to be on the safe side.

Consistent work over multiple years shouldn't be looked down on like this. If you've done something every day for years it's still a lot of time in your life. We're not econs and so I don't think summing up the time really captures it either.

> 3. "If I would have finished the project, this dataset would then have been released" ==> There is literally nothing stopping OP from still doing this. It costs maybe 2h of work and would potentially give a substantial benefit to others, i.e., turn this project into a win after all. I'm very puzzled why OP didn't do this.

They might not realise how to do this sustainably, they might be mentally just done with it. It may be harder for them to think about.

I'd recommend also that they release the data. If they put it on either Zenodo or Figshare it'll be hosted for free and referenceable by others.

> 2. "Get first analyses results out quickly based on a small dataset and don’t just collect data up front to “analyse it later”" => I think this actually killed the project.

I agree, but again on the kinder side (because they also agree I think) there are multiple reasons for doing this and focusing on why might be more productive.

1. It gets you to actually process the data in some useful form. So many times I've seen things fail late on because people didn't realise something like "how are dates formatted" or whether some field was often missing or you just didn't capture something that turns out to be pretty key (e.g. scrape times then realise that at some point they changed it to "two weeks ago" and you didn't realise).

This can be as simple as just plotting some data, counting uniques, anything. The automated system will fall over when things go wrong and you can check it.

2. What do people care about? What do you care about? Sometimes I've had a great idea for an analysis only to realise later maybe I'm the only one that cares or worse, the result is so obvious it's not even interesting to me.

3. Keeping interest. Keeping interest in a multi-year project that's giving you something back can be easier than something that's just taking.

4. Guilt. If I spend a long time on something, I feel it should be better. So I want to make it more polished, which takes time, which I don't have. So I don't add to it, then I'm not adding anything, then nothing happens. It shouldn't matter, but I've long realised that just wishing my mind worked differently isn't a good plan and instead I should just plan for reality. For that, doing something fast feels much better - I am happier releasing something that's taken me half a day and looks kinda-ok because

5. Get it out before something changes. COVID had or has no upfront endpoint.

6. Ensure you've actually got a plan. Unless you've got a very good reason, you can probably build what you need to analyse things and release it earlier. You can't run an analysis on an upcoming election, but even then you could do it on a previous year and see things working. This can help with motivation because at the end you don't have "oh right now I need to write and run loads of things" you just need to hit go again.