More

boyter · on April 10, 2024

Yes, although the lack of detail about the sparse grams is frustrating.

boyter · on April 10, 2024

You could, but I don't know what you gain out of it. The underlying index would be almost the same size, and n-gram would also allow you to search for e.t for example which you are losing in this process.

boyter · on April 10, 2024

Code search is indeed hard. Stop words, stemming and such do rule out most off the shelf indexing solutions but you can usually turn them off. You can even get around the splitting issues of things like

    a.toString()

With some pre-processing of the content. However were you really get into a world of pain is allowing someone to search for ring in the example. You can use partial term search, prefix, infix, or suffix but this massively bloats the index and is slow to run.

The next thing you try is trigrams, and suddenly you have to deal with false positive matches. So you add a positional portion to your index, and all of a sudden the underlying index is larger than the content you are indexing.

Its good fun though. For those curious about it I would also suggest reading posts by Michael Stapelberg https://michael.stapelberg.ch/posts/ who writes about Debian Code Search (which I believe he started) in addition to the other posts mentioned here. Shameless plug, I also write about this https://boyter.org/posts/how-i-built-my-own-index-for-search... where I go into some of the issues when building a custom index for searchcode.com

Oddly enough I think you can go a long way brute forcing the search if you don't do anything obviously wrong. For situations where you are only allowed to search a small portion of the content, say just your own (which looks applicable in this situation) that's what I would do. Adding an index is really only useful when you start searching at scale or you are getting semantic search out of it. For keywords which is what the article appears to be talking about, that's what I would be inclined to do.

sgift · on April 11, 2024

The preprocessing that you need is (in Lucene nomenclature, but it's the same principle for search in general) an Analyzer (the component, which knows to prepare the plain text that gets inside for storing it in an index and the corresponding component for a search query) made for code search. That's not different from analyzers for other languages (Stemming sucks for almost everything but English). Thinking about it .. the frontend of most compilers for a language could maybe make a pretty good Analyzer. It already knows language specific components and can split them into parts it needs for further processing, which is basically what an analyzer does.

boyter · on March 6, 2024

I actually half wrote a RFC of a spec and 2 implementations of a federated search last year. Rather than do the disturbed hash table that yacy does.

I wanted results to be re-rankable by the peers by sharing the scores that went into them. The idea being with a common protocol based on the ideas of ActivityPub you could get peers of searches working together to hopefully surface interesting things.

Something I should probably finish and publish at some point. It worked to the hundreds of peers I tested.

The reason I mention this is because I wanted to also add a front into yacy which tuned out to be harder than I expected. It’s a wonderful project and you can find great stuff through it but the way the peers return results sometimes it’s hard to find it again. It’s also not quite as hackable as I would have hoped at the time probably due to he project age.

I still think there is value in it though and I’d love to see yacy have its protocol explained as an apex so people could,build implementations in other languages more easily.

detourdog · on March 6, 2024

I remember the first days of gopher browsing were like that. Gopher browsing to me was like swinging on vine to vine. The trick was remembering/documenting where each vine went.

boyter · on Feb 23, 2024

In China it is/was.

Its very expensive for the average Chinese person and is considered fairly fancy or upscale. At least it was when I lived there.

Same rule applies to Pizza Hut.

boyter · on Feb 8, 2024

Can confirm this book is excellent.

It's one I point people at all the time when they ask me why something isn't working as expected in any standard search tool, and something I reference from time to time to refresh my own knowledge.

Well worth the money.

boyter · on Feb 8, 2024

On mobile device but it’s the standard weighting values for either TF/IDF or BM25. In this case BM25.

A comment would be useful but they are also instantly recognisable to anyone familiar with the problem.

6510 · on Feb 8, 2024

> instantly recognisable to anyone familiar with the problem.

I always love reading those when not familiar. It's almost as funny as reading something one already knows, waiting for the punch line...

boyter · on Jan 8, 2024

Just had all my connections cancelled. So extra day in San Fran for me which is less than ideal, but probably better than being on the flight if something happens.

It was total bedlam at the airport when I got in this morning however. With almost no flights available to replace the grounded ones.

Another red eye special for me tonight but at least no connections.

boyter · on Dec 11, 2023

That was my first thought. It looks perfect for use with HTMX.

boyter · on Dec 11, 2023

I don't think this would be practical for a static site. You still need to maintain a list of followers of your account somewhere and that needs to be dynamic if you want it to work the way people expect it to where they follow you from other instances.

Assuming you kept the @ list of accounts through some other means, if you had your webfinger setup with your public key, you could after creating new content to push up sign the publish events and push them to those followers.

I don't know of anyone doing this though.