This is very cool! I love how you brought back the original Kimono UI with the checkmark and Xs for adding and removing data tags.
We built WrapAPI (https://wrapapi.com) back in the day, before we ended up starting Wanderlog (https://wanderlog.com), our current travel planning Y Combinator startup. This definitely is still an unsolved problem.
However, from a business point of view, we found that it was rather difficult to make a business out of an unspecialized scraping tool. The Kimono founders expressed a similar sentiment: ultimately, scraping is a solution looking for a problem.
Developers can often roll their own solution too, which limits your customer base and how much you can charge. Instead, vertical-specific tools that target particular industries seem to be the way to go (see Plaid as an example!)
Alternatively, you have to be good at Enterprise and B2B sales. This is a product that you need to get the word out, get a champion, and do customer success on since it has a substantial learning curve. We were not, so that was why we chose to focus on other projects to start out
Best of luck, and feel free to get in touch if you'd like to chat more
Thanks! Yeah the checkmark confirmation just feels effortless. Haven't got it perfected yet, but soon.
Really appreciate the insights.
You're right that much depends on mapping the solution to a particular problem. Are you selling yet another scraping tool or are you freeing data to drive better decisions / save time / yada yada.
With the right frame, a sensible price point, and as much complexity abstracted away as is possible, there may exist a business model - seems to be many opportunities hiding in plain sight.
Will reach out soon for sure. Best of luck with Wanderlog
I tried your site and am curious that, for ko pha ngan there is only one recommended resource. Shouldn’t there be more?
On my mobile device on brave iOS, entering the Date in the calendar was janky FYI and i had to click another text box to keep my date selection and make the calendar widget disappear, so I could submit the form.
Plaid, Yodlee, and others abstracts away extracting data from various banks and financial services providers, so they're providing a solution built on top of the same data extraction techniques that this tool uses
Oh, interesting. I thought they just provided secure authentication to an app’s end users’ bank accounts for things like payments (an alternative to someone like PayPal doing two microtransactions, then having you confirm the amounts as a way of validating it’s your account). It’s not like Plaid is scraping financial data though, right?
Hey HN, I posted this in a comment thread the other day and (to my surprise) it got a positive reception so added a few more updates and decided to post it proper.
The idea is to be able to choose a website, select the data you want, and make it available (as JSON, CSV or an API) with as little friction as possible.
Kimono was the gold standard for a while so did yoink some of their ideas, while doing some other things differently.
Still needs some work but as an MVP would appreciate any feedback. Cheers.
When I saw this service last week, I think you had a section about a paid service where you do the scraping on a server and send the results. Do you offer that? How do you get around anti-scraping technology, if it exists?
Yeah, that's offered although it's currently free.
No particular tricks to avoid detection. It's Puppeteer under the hood with a few customizations which works well on the majority of sites tested so far.
Given the cat-and-mouse game around web scraping you may never cover every website, and that's ok.
I don't feel it is right to describe it as "turns a website into an API", rather "gives scraped data through an API".
"Turn website into an API", for me, evokes the image that I can automate (say) placing an order in Amazon as an API, or paying my bills automatically. It includes scraping, of course, but requires a lot more (mechanize/twill/selenium/phantom/etc power).
There was a company called Orsus that did exactly that. Last I heard about them it was the year 2000.
I like the idea but I was skeptical as to how well it works and noticed the video on the main page of your website which scans coinmarketcap seems to be wrong. It gets 200 cryptocurrency names but only 100 prices which means only the first result is correct.
I have a similar idea that I'm working on, your site is definitely bookmarked and will try the extension later.
Hi, it's both. There's 'local' scraping where the results of what you selected is ready to download as soon as you click results - no signup or server needed.
And then anything's that saved as a recipe runs in the cloud.
It's a moot point. I very much believe in copyright, but you can't just put info in the public domain and yell, "Take a look but don't remember/retain it" in the name of copyright. If I redistribute it or reuse it for commercial purposes without your consent then maybe there is a case. But if I am just scraping it, i.e remembering it... Come on now.
Otherwise everyone who gets the lyrics to copyrighted songs or memorizes them and sing them in the shower is also in violation of copyright. Which would reduce the whole copyright thing to ridiculousness.
All I said that it's against their term of use. I didn't try to make a point about whether it should be or not. If you are curious about it, and whether using pen and paper is allowed, take a look at it.
> Copy, modify or create derivative works of the Service or any Content;
> Copy, manipulate or aggregate any Content (including data) for the purpose of making it available to any third party;
Trade, sell, rent, loan, lease or license any Content or access to the Service, whether commercially or free of charge;
> Use or introduce to the Service any data mining, crawling, "scraping", robot or similar automated or data gathering or extraction method, or manually access, acquire, monitor or copy any portion of the Service, or download or store Content (unless expressly authorized by CMC).
What is it about this service as a business model that prevents it from taking off? I’ve known at least two YC startups that tried to build businesses around this idea.
I think one or both were acquired and immediately shut down, but I’m not 100% sure about that.
I think there are 3 things that contribute to this:
1. It is very easy to make a prototype that looks "magical" but very hard to build something that works in real applications. There are an enormous amount of quirks that a browser allows, and each site you encounter will use a different set of those quirks. Sites also tend to be unreliable, so whatever you build has to be very resistant to errors.
2. There is a technological wall that every company in this space reaches where it is not yet possible to mass-specialize for different websites. So even if you're able to build a tool that works very well on any individual website, the technology is not there yet to be able to generalize the instructions across websites in the same category. So if a customer wants to scrape 1000 websites, they still have to build custom instructions for each website (5-10x reduction in labor vs scripting) when what they really want/is economically viable for them is to build a single set of instructions that will work for all similar websites (10000x reduction in labor vs scripting). This is something that we're working on for the next version of parsehub, but is still a couple years away from launch.
3. Many of the YC startups you hear about have raised funding from investors and have short term pressures to exit.
The combination of the three makes it very tempting to give up and sell.
#2 is what would transform this from a nice niche tool, to something very valuable. In the ecommerce space, tracking competitor pricing is a great example of this type of thing. I can also see use casese for recipe's, finance, healthcare, you name it. Those b2b use cases are worth real money.
Just curious, in your experimentation, have you found it necessary to train a new model for each "category"? Or have you found a way to generalize it?
Training a new model for each category is already possible today, but doesn't achieve the goal (mass-specialization).
The problem is that when you pre-train a model, you can only solve for the lowest common denominator of what every customer might want.
In ecommerce, for example, you might pre-train to get price, product name, reviews, and a few other things that are general to all ecommerce. But you won't pre-train it to get the mAh rating of batteries, because that's not common to the vast majority of customers (even within ecommerce). It turns out that most customers need at least a few of these long-tail properties that are different than what almost every other customer wants, even if most of the properties they need are common.
And so the challenge is to dynamically train a model that generalizes to all "battery sites" based on the (very limited) input from a customer making a few clicks on a single "battery site".
1. it's possible to make it "easy to switch" by having common building blocks and only changing the "selector" across sites - lots of companies in the space do this
2. it's impossible to do "just DOM" or "just vision/text" if you want to be able to generalize "get the price of the items"
- DOM doesn't represent spacial positioning very well (see: fixed/absolute positioning, IDs and dom changing without the visuals changing, ...) so you'd need the equivalent of an entire browser rendering engine in your "model" anyways!
- vision/text is messed up by random marketing popups (see: medium, amazon, walmart, ...), it's significantly more computationally expensive to do, and can't currently get >95% accuracy (which makes it useless, scraping needs very close to 100% accuracy in most use cases)
> So if a customer wants to scrape 1000 websites, they still have to build custom instructions for each website...
Can't this be crowdsourced in some way? Having each individual entity reinvent the same wheel feels like the main problem to me. What if there was a marketplace? The ability to buy / trade / sell? Maybe subscription based in some way?
If I wanted to scrape 100 sites, it might be worth $1 per year per site. Those who put in the time make money. Those who don't have the time would pay.
This isn't a technology issue per se. It's scaling a solution to the final gap the technology can't cover. A different kind of mechanical turk?
Yes. But there might be some who would not be interested but still do it for minimal pay.
It would also lower the barrier to entry and thus increase the size of the market. Imagine if the first X sites I tired all needed more work. I'd likely quit. But if that didn't happen, I'd more likely continue.
Crowdsourcing isn't The Answer. But it's certainly a better step in the right direction.
Possibly a mix between use cases, maintainability and economics. We used to scrape economic indicators data at a fintech startup and monetized it - every slight change to the website created an issue to the data feeds. It was a huge nightmare to maintain. Scraping any website is quite generic and doesn't really speak to a specific audience on a specific need. But more importantly, having been in the data and analytics industry for years, data has far lower margins than insights and recommendations. The market is willing to pay a crazy premium (look at how much all the consultants are being billed out for) to get insights and recommendations. Data itself isn't inherently valuable to most companies.
Repeatedly being acquired to be immediately shut down sounds like quite a good business model, if your goal is to be paid.
I wonder what other kinds of products and services would be good for that model. In other words, would tend to be acquired for good money in order to stop them.
Presumably, a company that wants your product to be shut down.
Potentially apocryphal example: I've heard of a certain FPGA company that bought a startup which produced FPGA compilation tools that could target multiple vendors' devices, in order to stop multi-vendor tools from existing because it made switching vendors too easy.
Your market is people who need scraped data to input into some kind of app/program/code, but don't have the resources/skills/time to use scrapy or whatever.
2. Sensitive to configuration
This is also the problem with visual code and ML apps, but you even a small issue with the source you are scraping from -- say, captcha, or login, or some weird format or css you did not anticipate -- makes it almost useless, whereas if you were coding up a solution you can (usually, not always) deal with it more easily.
Those are the reasons they shut down.
The reasons why they launch:
1. Many developers have this need
Many developers have built scrapers internally, and then used them so a lot of people have worked on this problem.
What follows from this is that they can productize it, see that other people have the need, imagine the market etc.
Maybe a better business model is to offer this as a service to site owners who are not tech savvy. Site owners then have the ability to offer an API to new customers making it a win / win. Site owners can now offer an API (free or paid), and API consumer can rely on getting data in the future.
I just gave this a shot on the ISO website to get a list of country codes[1], but it seems the selection algorithm breaks down when there's no specific classes applied to elements, as every td.v-grid-cell is selected, which is all of them, instead of the values of the alpha2 column for example.
This seems hard to solve entirely programmatically, maybe having a way to be more specific by providing a selector yourself or selecting multiple entries and having the plugin figure it out could add a lot of utility in such cases.
I must have skimmed past that. Whoops. I avoided trying it out because it's not available on firefox, so I couldn't correct my assumption by testing it. Also, couldn't easily find copy of the extension source and gave up.
The site/extension basically has to do that each time it scrapes locally (or use generic parametrised scraper) If you wanted to use it in an API, my impression is that you can run it in chrome as an extension you need to get from the chrome store or tunnel your data through a third party server. Is that wrong?
Can you scrape data locally without running chrome/the extension? I can't tell from reading the site, sorry. (if it's actually there, please link an anchor tag to it or something please)
Please consider adding the ability to script clicks on elements, e.g. buttons.
I manage a site where we load a subset of articles on initial page load and then have a "Load more" button that executes Javascript to load another batch of articles. Getting a list of articles from our CMS is a bit of a hassle so being able to scrape it easily instead would be ideal.
Yes - you're able to save data behind a login using the point and click functionality as it extracts whatever data is loaded in your browser ("local scraping").
And no - if you choose to also create a cloud recipe that runs on the server, the remote browser instance won't be able to access data behind a login.
It's possible but I'd rather not store third-party credentials for the time being.
This is super cool. I really enjoyed and missed the kimono workflow. Automating something like this with browserless.io would be really fun (I run that project). Extensions is one of the things we’re looking to support.
Anyways give me an email at joel at browserless dot io if you ever want to chat
Right now it's free and will be until it's stable. Starting price will be about $25 for 4000 scraping credits, 200k API calls and data storage.
This will likely change as I have more stats and feedback on usage and expenses. But the goal is to offer a price point that's fair and low relative to other options.
We built WrapAPI (https://wrapapi.com) back in the day, before we ended up starting Wanderlog (https://wanderlog.com), our current travel planning Y Combinator startup. This definitely is still an unsolved problem.
However, from a business point of view, we found that it was rather difficult to make a business out of an unspecialized scraping tool. The Kimono founders expressed a similar sentiment: ultimately, scraping is a solution looking for a problem.
Developers can often roll their own solution too, which limits your customer base and how much you can charge. Instead, vertical-specific tools that target particular industries seem to be the way to go (see Plaid as an example!)
Alternatively, you have to be good at Enterprise and B2B sales. This is a product that you need to get the word out, get a champion, and do customer success on since it has a substantial learning curve. We were not, so that was why we chose to focus on other projects to start out
Best of luck, and feel free to get in touch if you'd like to chat more