I've implemented Google Custom Search Engine on a few sites before, and it's not clear if this is the same (repackaged?) or different. The biggest question (as with many Google products) is when this it going to be shut down, given that Google Search Appliance itself reached its end of life in 2019.
If you want a dead simple way to integrate search onto your site/app but find the learning curve of something like Elasticsearch steep, take a look at this project that I've been working on:
Typesense is an attempt to offer a great search experience out of the box -- with minimal configuration. It also supports clustering/HA, so you can certainly go a long way before considering something like ES (which is a great product but also pretty beefy).
Happy to hear any feedback on what I can do to make it easier. Typesense does not require you to define a full index-time sorting order (except for a default sorting field) so it's pretty flexible on what fields you can sort on at query time.
Oh, the old Google Custom Search Engine is still around, rebranded. Think it was called also Google Co-Op once.
I used it for some hacks i.e.: a people search engine once. And some SEO indexing checks before Google Search Console existed. Became pretty useless as the index was then seperated from the real Google Index.
Looking at it there is a way to upload XML files. Mhmmmmm... prop. good starting point for the next Google Bug Bounty Hunt.
I had the pleasure of working with two of these. They did not give you the same quality of results that Google's website did. Utter garbage in a blue rack mounted case.
One thing I've heard from multiple googlers is that their internal wiki search is crap exactly because there's limited hyperlinking to give a good ranking for pages. Google's algorithm basically crunches a lot of human provided data (inbound links, clicks, dwell, etc.) into an algorithmic score, but with limited data it's not much better than tf-idf.
That's probably the same thing that you've encountered with the search appliance.
It's a search scoring algorithm, combining how frequently the search term is used in a document with how frequently it appears in other documents.
A word that shows up in the document and no others means it gets a very high score, vs "and" shows up in every document and thus doesn't score well for any
Pretty much. Internal search at Google actually used these appliances for a long time. Fortunately it's still been evolved and was externalised as a Google Cloud product a few years back.
I used to sell those. They were actually quite good for their time, especially when they first came out, but they were really expensive.
Its weird that Google has been so half-hearted about «enterprise search» since they stopped selling GSAs. You would think that they would be itching to know what lies behind your firewall.
Googled for confirmation that the Google CSE index is different from "the real Google Index" and haven't been able to confirm that.
I've tinkered with Google CSE a lot and it does have its uses. The knowledge map options are fun to tinker with even if a bit of a black box.
As for the xml - when you download the xml that's intended to be your starting point for an engine, it doesn't include parts of the definition of the CSE like sites you've listed to exclude. So that functionality seems to be a little broken, but I've confirmed you can add and remove elements of the knowledge graph at least via the xml file.
From a relevance PoV, I think the docs here[1] tell you everything you need to know
In short it's a rules and keyword-tagging heavy experience, not at all comparable to Elasticsearch or Lucene (which let you really get down to the metal to customize core algorithm behavior).
BTW if you want this kind of experience in an open source stack, I would check out Querqy for Elasticsearch and Solr[2]
Looks nice, but as you say that just searches the current page. Here's how I search my entire site with slightly fewer lines of JavaScript:
// Construct the API query
const apiEndpoint = 'https://searchmysite.net/api/v1/search/michael-lewis.com';
let urlParams = new URLSearchParams(window.location.search);
let queryParam = urlParams.get('q');
if (queryParam == null || queryParam == '') { queryParam = '*' }
let apiQuery = apiEndpoint.concat('?q=', queryParam);
// Build the results (using fetch rather than XMLHttpRequest)
fetch(apiQuery)
.then((resp) => resp.json())
.then(function(data) {
let searchResults = data.results;
document.getElementById('query').value = queryParam; // Set the value of the search box to the query
if (searchResults && searchResults.length > 0) {
return searchResults.map(function(result) {
// Each result is going to be displayed as <li><a href="${result.url}" class="title">${result.title}</a></li>
let li = document.createElement('li'), a = document.createElement('a');
a.appendChild(document.createTextNode(`${result.title}`));
a.href = `${result.url}`;
a.classList.add('title');
li.appendChild(a);
// Each result is added to the <ul id="results"></ul>
document.getElementById('results').appendChild(li);
})
}
else {
// If there are no results update the <h1 class="title" id="results-title">Results</h1>
document.getElementById('results-title').innerText = 'No results';
}
})
.catch(function(error) {
console.log(error);
});
Hmmm, I just tried setting up a new site search for an existing site and can't get any search results to show up. I'm guessing it takes awhile for the search index to populate, but that isn't really messaged anywhere in the product UX.
I set this up a while ago but never turned it on. I have a very simple web site - it's basically my mixology recipe book - and I started adding https://lunrjs.com to it this weekend. I ended up yak-shaving for the bulk of my time, but initial experiments were promising; anyone used it?
I highly recommend lunr.js. I've used it on side projects both client side [1] and server side [2] and it works great. My only Yak-shaving was tweaking the separator regex.
How does this compare to Algolia? I was shocked at the irony when I discovered that Firebase docs recommended using Algolia for full-text search queries over Firebase data.
I think that has to do with the fact that both Algolia and Firebase went through YC. IIRC, PG stated that the first 100 or so customers of YC companies are in fact YC Companies. I think Firebase decided to give Algolia a try early on and decided to stick with them since.
The thing that’s kind of wild is that Firestore, which is their new reimplementation of the original Firebase db (now Realtime Database), is what is recommending Algolia:
Given a clean slate I would have thought they might push Cloud Search or this service, even without any special integration. Candidly I haven’t taken the time to understand either of those offerings but I have to assume you could could index JSON documents just as easily as with Algolia. Not that there’s anything wrong with that; it was just surprising to see them promote a third party in general and a search engine specifically. After having used 90+ Google products over the last 17 years, transacted many millions of with them, and read innumerable pages of docs, I’ve not really seen anything quite like that— where a quite common use case of their product actively promotes a competitor and links to them.
EDIT: It's really Custom Search Engine rebranded to me, it doesn't index all you website, I still prefer Algolia if I had to have a external search provider.
EDIT 2: Seems very customizable, maybe by adding all your urls it will index all you site, making it a viable alternative to Algolia
Was curious, and found a "custom search engine" (previous name of this product) for my tumblr blogs (which are still online) that I made back in the days.
I tried it ... and it's terrible for my use case. Because these blogs are mostly not hyperlinked anywhere (one is a personal quote collection, another is a collection of hacks and tech tips I found useful), Google doesn't seem to even have them indexed. Nevermind that I manually submitted them for indexation back in the day.
Not that I expected much better from Google at this point.
What a horrible product name. I've always lamented the fact that big brands need very little creativity in product naming. Though Google is sometimes but often lazy, I can at least ascertain some reason behind their product names:
- The really boring ones like Google Docs, Google Maps, Google Talk, even GMail and Inbox, etc. were all potential generic B2B products so there's an incentive to make it as simple (read: boring) as possible. Even your Executive Emeritus couldn't second-guess what Google Docs does.
- Nexus, Pixel, and Stadia were all competing for a demographic that needed to catch attention; as names these aren't boring at all. "Google Phone"/"Google Smartphone" will be dead even before it launches against iPhone.
- Kubernetes kinda falls between the two. It's a B2B product but it introduces a completely new concept so it needs its own name. Plus they made it a standard so you can't slap the big G before it.
Now, "Programmable Search Engine" just sounds like a product name concocted by naive CS-undergrad seniors[1]. Did no one suggest "Google Search Appliance Plus" (no more confusing than "YouTube Red")?
[1] My CS undergrad senior batch would've named this "Programmable Search Engine System". God forbid anyone think what we're doing is not related to software engineering.
Off-Topic question for Search-heads. Anyone know of an in-browser solution to make web search start with your bookmarks (rank previous bookmarked items higher)?
[Solution to search browser-URL-history first before the rest of the WWW][1]
If you want a dead simple way to integrate search onto your site/app but find the learning curve of something like Elasticsearch steep, take a look at this project that I've been working on:
https://github.com/typesense/typesense
Typesense is an attempt to offer a great search experience out of the box -- with minimal configuration. It also supports clustering/HA, so you can certainly go a long way before considering something like ES (which is a great product but also pretty beefy).