Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: I scraped 200M Shopify products to build a search engine (searchagora.com)
23 points by pencildiver on Feb 21, 2024 | hide | past | favorite | 42 comments
Hi HN! In December I launched an MVP for Agora here: https://news.ycombinator.com/item?id=38635695

After posting, we got thousands of users and hundreds of comments with valuable feedback from the community. I spent a couple sleepless nights frantically pacing around my room trying to keep the product live and, relatively, performant. After getting some sleep, I got back to work to make the product better.

A few updates:

1. We've grown from 25 million to 200 million products on Shopify and WooCommerce. The team at WooCommerce reached out after the HN launch to help us figure out how to index their stores. Similar to Shopify, we found that there’s a public file available for all stores that use Wordpress and WooCommerce at [Base URL]/wp-json/wc/v1/products. For example, the file for Good Works Tractors is available here: https://www.goodworkstractors.com/wp-json/wc/store/v1/produc... So I bought a list of 3.5 million active WooCommerce stores on a website called BuiltWith, adapted the product data model, and started the crawler to go down the list. We've indexed around 515k stores so far.

2. We improved the search experience. We're using Mongo to host the 200 million product records. First, we switched from Mongo Atlas Search to Typesense. After testing Typesense with our product records, we found most searches to be under 200ms. We're not storing the product images which slows down the loading speed at times. This week, we set up a server using Paperspace to run SBERT embeddings on a GPU (new to the AI workflow so apologies if I get the lingo wrong). We quickly realized that the dimension size of the embeddings matters a lot here, given the size of the data set. The GPU is still running to process all 200 million records and we're about a week away from releasing AI-powered search.

3. We localized the user experience. There's now frontend and backend IP detection to only show users products that are 'based in' or 'ship to' their specific country. This 'ships to' filter (i.e. stored in all Shopify stores in the /meta.json route like https://wildfox.com/meta.json) significantly slows down the search results but we're trying to get creative on the loading process and animation. For example, we're using Revalidating on Next.JS to give several pages a 'hard coded' feel and the data refreshes every 60 seconds. https://nextjs.org/docs/app/building-your-application/data-f...

4. We got our first few paying customers. Store owners can sign up for free to track their store's performance on Agora. We validate that they are the store owner by making sure the email address and store URL match on sign up, and then send them an email verification link. They can upgrade to a subscription tier to 'verify' their products to get better placement in relevant search results. Additionally, they can pay to 'boost' products and guarantee that they'll show up in the first row of results. Given the high purchase-intent searches on Agora, I'm finding this to be the right business model.

The next challenge to solve: We need to improve the quality of products on Agora. There's a lot of resellers, dropshipping stores, and low quality images. Now, just because a product is sold on a reseller or dropshipping website, doesn't mean it's a bad product. There's a lot of exceptions and edge cases to solve. One potential solution: we're considering coming up with an "Agora Score" that takes in several factors including the image quality, store name, brand name, website SEO, etc. to tell users how trustworthy we think the product is.

I'd love any feedback or advice. I did solve my original problem of finding 'red shoes' for my wife, but inadvertently created more problems for myself. I'm loving every minute of it though. My wife jokes that everything is now "Agora this...Agora that". Open to any advice on that as well.



Don't take this as a harsh criticism but I want to know what problem are you solving? Is this just for fun?

This is a lose-lose game. You will never be able to catch up to the providers (shopify and woocommerce and others).

What you are doing is not a search problem. It is a traffic problem of which you have little to none. There is a reason why Instagram and FB works as a driver for ecommerce products. My suggestion is to test the market before you invest too much in this area.


Finding products that aren't on Amazon or other top sites is a problem for consumers. Google Shopping is pretty bad. Connecting buyers direct to merchants is a win and something they all want.

Every single entity that sells online is trying to boost their revenue share for direct to consumer as a mandate. Because they get more margin and they get to know their customer for ongoing marketing. The pandemic exposed many product manufacturers and brands that had such a high % in brick and mortar partners. They frantically tossed inventory onto Amazon. The only problems with that are

1) amazon tax on sales and

2) they don't get to know their customers.

Indeed aggregated job postings until it had traffic and then it became the place to post jobs.

Google was nothing against other providers until it did better search. And then they got traffic and built a solid ad product that providing bidding on effective cpc not highest bid like competitors were doing. They rewarded better peforming ads and rewarded themselves ($$$) at the same time and the rest is history.

Your thesis is not accurate. This can work. Typically though, these things require a lot of capital until they make real money. Tech costs and then marketing which is really education can be expensive.

Most users originate ecomm searches from Amazon (60%) [1]. Next highest is Google

If the search works really well though, word of mouth can be everything.

I'd probably focus on the audience that has income and would prefer to buy direct from merchants and not search and buy from Amazon which can be painful (too much crap, fakes, bad reviews, doesn't always have high end products, not always cheaper, etc). Maybe even focus on higher priced items first ($250 or $500 and up). They'll likely have better content associated with them, there's typically more margin, and there are a lot of high end products not sold on Amazon, only direct.

Hacker news has a good launch audience for this. Also probably Etsy buyers and merchants.

Not that you would be interested in this - but ecomm search for influencer sponsored products would probably do very well as a niche product. About 11% of searches originate on TikTok. Younger people are into who is hawking what for some reason lol. You could even rank most popular sponsored posts for the celebs too as a leaderboard for the biggest shills haha.

[1] https://www.insiderintelligence.com/content/online-shoppers-...


You refute my point and then agree with me that it can be expensive.

I worked for largest eCommerce employer twice and have built scraping and search systems at scale. I know a thing or two in eCommerce shopping sites. Do you know how much infra and people it takes to keep pricing, product, variant data up-to-date? Hint: It's significant capex investment.

Problem is entry point funnel in the shopping journey. If all he becomes an aggregator, it is no different than thousands out there. There is Shop.app which has all access to seller data and analytics. As I said, this is not a search problem. It is product discovery problem which you also seem to agree with. Unfortunately, not a lot of people/teams have managed to solve dicovery problem when site itself is fairly unknown.

Michael from YC also talks about product discovery being a tarpit idea.

I am not saying his search won't work. Search in itself is a solved problem here. You enter criteria/product name and you get results. But, unless you truly solve discovery, which in my opinion he can't do due to lack of first-party data, it won't do much besides a hopper site.


You never even commented that it can be expensive just said it's a terrible idea because it's "lose-lose" and a "traffic problem".

Anyway, I refuted ALL of your points and even supplied angles for keeping costs down while driving value to defined audiences.

I personally scraped and ML'd all AirBnB listings and calendar data every day for over a year to build a product. It didn't cost that much in time and $. Got a shit ton of value out of it. Someone even built a nice business doing exactly what I was doing.

If you look on the other thread for the OP, someone had done 100m products for $550/mo. You also must not understand that shopify and woocommerce have structured data available because you are talking about people being needed to keep product data up to date. They aren't scraping the frontend.

Personally I'm a riches in niches guy and a more defined audience and product set would be the way to go off of what they've started.

The ocean doesn't need to be boiled to provide value to consumers and to merchants here.


> You enter criteria/product name and you get results

Not on Amazon. At least not since 2020 or so.


GP is correct. There is not much value in search itself.

Are you serious about Amazon? I can absolutely find everything I need on Amazon. Who needs exact search when there are better alternatives through recommendation, people also buy etc?

I am curious to know — if you think search on amazon as broken as you think, why do you think people keep buying from Amazon? They have strong logistic network. But, to even start that logistics part, you need to search and buy.


Amazon (as of today, in Europe, for me) is a Wish with better logistics and replacement policy.

If I search for a product like a milk former or a wine opener or a phone case, I already expect the results to be 80% copycat crap and the reviews fake.

Adjusted ratings by various fake spotter tools usually bring down 4.5+ ratings down to 3.0–3.5.

That the search engine absolutely refuses sometimes to include words I want and exclude words I don’t want is the icing on the cake.

If I didn’t forget the renewal date, I’d have quit Prime (which I subscribed to since its introduction) already last year.


I don't take it as harsh criticism. I try to remain intellectually honest about things so feedback and criticism is welcome. Just couldn't back to you earlier as our search experience went down.

For users (demand-side), the problem we solve: There are user groups that currently have a fragmented experience. For example, a specialized solar technician (just throwing out a random example) has to look through a handful of speciality stores to find and compare products that are only sold there. I think there are specific user-groups we can go after that really feel the pain right now of this process. Additionally, as the number of e-commerce platforms increases, it becomes tougher for every day users to find products they are looking for. They have to either go to Amazon or go store-by-store to discover products. The shop.app solves it for Shopify store but there's also millions of sites on WooCommerce, Squarespace, Wix, etc. We get around the empty-state problem with the crawler and now have merchants signing up to get their products indexed.

For merchants (supply-side), the problem we solve: If they sell on their own website, they have to compete against non-product pages on Google. For example, if you sell "red shoes" on your own site, you have to compete against the IMDB entry for the movie "Red Shoes" for people to find you. Additionally, if they sell on their own website and use Amazon (or any physical retailer) for distribution, they give up a percent of their margin. This increases your sell-through but is a smaller amount of money in your pocket.

I'll note that I've seen this problem first hand. In 2016 I launched a game called The 2016 Election Game, which was like Cards Against Humanity for the 2016 US elections. Sold about 5k units fulfilling order myself. And then again in 2020 called DoneWith2020, which was like Cards Against Humanity for the absurdity of the 2020 year. Sold about 34k units using mostly dropshipping. I remember losing out on search / discovery by choosing to sell on my own store but made a much higher margin on each sale (i.e. made about $15 on each $24.99 unit sold). We did work with a company to get on Amazon but always preferred people purchasing on our own site. It was also really hard to get high intent traffic to my store from ads. Would have been nice to send people searching for "funny card game" to my site. Now if everyone has my same dark sense of humor once they landed on the site, is up for debate.

The goal isn't to catch up to Shopify, WooCommerce, etc. but to rather aggregate products across platforms. I do think we can index most of e-commerce products sold on these platforms (my best guess is that it's somewhere between 10 - 20 billion products). This is obviously a very tough data hosting and search problem at that scale. Even Mongo, which is what we use as our primary database, has a limit of 2 billion records.

I agree that it's a traffic problem. Everything comes down to getting users. Based on the number of merchants signing up, we are validating that others have the supply-side problem. It's a matter of nailing down the demand-side problem (i.e. finding the right user groups, building the right features for engagement, etc). We use 'search' as the conduit, assuming that exceptional search will lead to more traffic. But agreed that there are several other factors to solve.


Oh is the site currently down? I tried a few queries, including the ones on the landing page. It gave me empty results.


Yes, we're currently down. Working as fast as possible to get our search API back live. Really sorry about that, server crashed a few hours ago.


Seems like a fun scraping project, I think you have to work on extracting more accurate categories though, for example this link does not really include snowboards for me: https://www.searchagora.com/search?query=Snowboard And the first products I clicked have rather weird descriptions, https://www.searchagora.com/products/snowboard-bd2a90aa-6808...

Maybe its my location (South Africa) but I also cannot visit the product store when I click through


Agreed. I'm currently saving a data field for 'product category' but this is defined by the store owner and currently not used for search. Trying to figure out the most reliable way to categorize products, to then make it available as a filter to narrow down the selection.

Additionally, the search for "snowboard" vs "snowboards" returns back different results which isn't ideal: since the user intent is the same. This is something I'm hoping to resolve with AI-powered search.

In the footer, I changed my location to be South Africa and a few other countries to try the same search to see what products come up. Thanks for the feedback and heads up on this!


How do you plan to drive customer traffic to this site? As others mentioned, it's bare bones, raw search engine. I think these days, consumers need something more than just a bare choice because it's too much. People get paralyzed when they are presented with multiple options. I think if you could develop something that works similar to interest or Instagram, that would be more interesting, especially for female consumers who love to spend time on sliding endless feeds with items to buy.


Generally I'm of the belief that there are 2 types of users, with different acquisition strategies:

1. Users who want to find a very specific product, with the intent to purchase. This assumes that the search functionality and quality filters are working very well. User comes to Agora, finds exactly what they are looking for, and continues with their day. We save them time and provide them with a lot of immediate value. I think we'll acquire these with word-of-mouth, if the search is actually exceptional and we have the most comprehensive data set. I'll also note that limiting the search results for this user is likely better than an endless scroll (to your point about being paralyzed with choice). We're playing around with the "view more" button at the bottom of search results: showing a user 20 results and then letting them click on view more to see an endless scroll.

2. Users who spend a lot of time online shopping, with the intent to browse. For these users, the plan is to introduce features that give us a viral loop. For example, users can currently create an account to create 'lists' of products they like. We have users making 'birthday gifts list', 'party decor', 'Christmas wish list", 'bachelorette party inspiration', etc. Think of it like a Spotify playlist. The goal is to come up with more features like this that drive a viral loop back to Agora.

Hope that helps answer the question and how I'm thinking about it. Separately, we need to upgrade the design soon. Been so focused on functionality that need to shift to improving the look / feel. Open to any ideas :)


This man really knows what he is talking about.

I am one of such customers, either looking for that Peripheral / Tech with exact criteria, or I am just browsing around for new releases and recommendations (mostly on blogs & tech forums).

I am not interested in the toy or makeup advertised by the top post on instagram. I am interested in products with long review by a pro reviewer and some honest discussion around it from random customers.


How often and how do you plan to update images, prices and descriptions? Also, I noticed some "more from merchant" links don't work, for example: https://www.searchagora.com/buy-online/https:/www.bigbuy.eu/...


Unfortunately it seems the underlying search API is throwing '{ "message": "Not Ready or Lagging"}' for every search


Just woke up (in Madrid currently) and seeing all the errors. Working on getting the product back live.


We're back live now. Had to set up a new search server and quickly importing 100k products at a time. The results should be better by the minute now (as the data set increases).


Love to see that you have posted again, I commented on your post last time! I have two main questions here. Firstly, why would Shopify or Woocommerce not build this themselves? And secondly, how do you intend to drive traffic to the web? I can see how you will solve the search function at scale, but I see a bigger hurdle in driving initial traffic to the site


Thanks! Looking back at your comment history, apologies for not responding to your comment last time. And appreciate your email after that post with advice as well.

So, Shopify does have something similar called the shop.app which is only for Shopify stores. My best guess is that the e-commerce platforms are solving a different problem: store creation. I'll also note that each individual e-commerce platform isn't incentivized to aggregate and send users to stores not built with their platform. When we launched in December, we only had Shopify stores. Now we have Shopify and WooCommerce, and working on support for Squarespace and Wix sites as well.

For user traffic, the primary strategy is to build features with a viral loop back to the product. I mentioned this in another comment but we have have the concept of making shareable 'lists' of products you like. This is already working for us, at a small scale.

The general plan is to aggregate as many e-commerce products as possible from different platforms, keep improving the search experience, add automated filters to ensure high quality products, and then keep layering in features that drive a viral loop back to Agora.


Makes sense, so you will basically focus on discovery of e-commerce products which would then allow you to "sell" traffic to merchants? On the user strategy side, hope that works, if you can get the viral loops going and constant organic traffic, this could be a success. I'll get back to you if I come up with any ideas which can help!


Generally, yes. From early conversations, it's both about the quantity and quality of traffic. For example, 100 qualified leads to their site that searched for a very specific product on Agora is better than 1,000 random leads landing on their site. The baseline assumption is that a user with higher purchase-intent will lead to a higher conversion rate once on the store site.


You should get metrics on searches that yield zero results and investigate why. Getting zero is a turn off! My example: timber


Absolutely. Search went down for a few hours. The 'no search results' found also went down, which made it even more confusing :/

Here's the search results for timber now: https://www.searchagora.com/search?query=timber&price=&type=...

I'll note that this is a good use-case for AI-Powered search, as Agora isn't detecting the user's intent to find 'timber wood' here. There's a couple products made with timber which is a good start.


> The next challenge to solve: We need to improve the quality of products on Agora. There's a lot of resellers, dropshipping stores, and low quality images.

Glad to see you’re thinking about this. The sheer prevalence of dropshipped junk on Amazon is a huge problem and I’d happily shop elsewhere if I could find a good way to discover products.


Absolutely. I think the key is quality over quantity. Still figuring out how to automatically remove bad products but keep the hidden gems.


Well done! A lot of progress since last time. Have you guys considered using AI to categorise products (ie; create labels using product images), instead of using the text to match the search? I say this cause I sometimes see some irrelevant products and I can tell you guys are basing the search on text


Thanks! Definitely a work in progress but getting better by the day.

We have run tests with image-detection to try to categorize products. We currently do search based on the name, description, price, store, and brand.

The problem with image-detection is cost. Given the size of our data set, it's very costly to run 800m - 1b images through a model (i.e. most products have 4 - 5 images). We've considered only doing the first 'hero' image to start though. Open to any cost-effective ideas though.

For example, if you search for "wooden chair", it would be nice to select a filter for 'category' to narrow down if I want to see "office furniture", "dining room", or "art".

https://www.searchagora.com/search?query=wooden+chair&count=...


I found some things on Github you could use, I'm not a dev myself and I'm not sure how scalable these are, but have a look, maybe there's something useful. https://github.com/jhc13/taggui

The category filtering is what I wanted to get at, I think the search would improve a lot.


Super cool, thanks! Will check it out.


Search is not working. Also, I seem to get shoes for anything I search. Did you hardcode it by any chance?

Are you open to collaborate with others? I might have an automated method of curating products. Please drop a line to comp [dot] turkey [at] gmail.com.


Sorry about that, we're back live now. You may have searched when we restarted the server which caused the shoes issue (i.e. "red shoes" is the go-to search for testing).

I'd love to collaborate. Automating curated product lists is very top of mind. Right now, a registered user can create 'lists' of products and then share those lists with others (similar to a Spotify playlist). Creating curated lists as inspiration and to drive a viral loop is the next step. I'll reach out via email.


Hey! Cool project, my co-founder told me about this. I suppose you're getting initial traffic from search engines, isn't this just adding an extra step for users as most search engines already display products at first level?


Thanks! We actually don't get much traffic from search engines. Mostly all through direct and social, with a few spikes due to HN. The product URLs listed on Agora aren't indexed by Google currently as we don't facilitate purchasing (i.e. like how you'd find a product sold on Amazon when doing a Google search).

Generally, Google Shopping shows big retailers or ads. E-commerce stores have a tough time competing against these big retailers, on both Google Shopping and a normal Google Search.

That all said, we are running tests to track engagement of a "search result page" indexed by Google or Bing. For example, searching for a "backpack" and then landing on the below link with selection to choose from.

https://www.searchagora.com/search?query=backpack


Interesting. Thanks for getting back to me with such a detailed explanation!


It doesn't work. "Soap" returned 0 results.


Sorry about it not working. We went down for a few hours. We're back live now:

https://www.searchagora.com/search?query=soap


I don't get a single result for any searches.


Sorry about that. The Search API went down. We're back live now.


‘Paper’ zero results


Sorry about that, the Search API went down. We're back live now: https://www.searchagora.com/search?query=paper&price=&type=p...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: