Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Implementation of various string similarity and distance algorithms in Java (github.com/tdebatty)
88 points by based2 on Aug 6, 2016 | hide | past | favorite | 27 comments


This library is implemented for some use-cases. It calculates how many different characters or changed character on a string.

String similarity and distance algorithms are different with semantic similarity (knowledge-based). It calculates the similarity based-on relationship of word to another word (hierarchy/tree). So, If you need a library that can calculate semantic similarity instead of string and distance in Java, I would recommend this [1].

[1]: https://github.com/sharispe/slib/


So, I just started working in Java at an internship this summer. With mvn, the package manager is fantastic. I was naive and thought things like `npm` were unique in their philosophy, but realized it's very similar. I hate debugging Java in an enterprise environment, but I have to say that it's really elegant to code in Java and use the standard library.


What's wrong with debugging java? Between reliable remote debugger attachment, mvn + IntelliJ's support for automatically building and distributing sources, decent decompilation in case that fails, JMX, and tools like VisualVM, it seems to me that Java is head and shoulders above most "trendy" languages in the debugging department. Is there a cool new tool that I'm not aware of that hasn't made it to Java yet?


Yes, Java is a superior dev env. And then some says something like "We need DI aspects with our annotated IoC!"

Spring (and others) is an exception (stack trace) obfuscation framework.


Yeah. And with the bean configuration XML files, the error messages suck if there is anything wrong with it.


I think my case was an isolated one. I had a Java package that wouldn't start in Tomcat because I had to merge trunk into my branch - Was very annoying and made me hate seeing java exceptions with a 100+ call stack.


Not rare. Wrestling with Maven and Spring pretty much dominates IT-style development.


Similar library for Python: https://github.com/jamesturk/jellyfish


Good job! One question though. In cases like this when there is generally no state, isn't it better to make the methods static?


Depends, but objects aren't just about state. There's dynamic dispatch too. You can have different stateless strategies, and at runtime you might want to be able to swap one out for another. If that was a requirement, static methods don't work.


Looks like parent post is on the money for the reason this library was done w/o static methods. Note that there are interfaces in the library.

https://github.com/tdebatty/java-string-similarity/blob/mast...

The example code in README.md can probably do this instead:

    MetricStringDistance msd = new Levenshtein();

    System.out.println(msd.distance("My string", "My $tring"));

    MetricStringDistance msd2 = new MetricLCS();

    System.out.println(msd2.distance("ABDEF", "ABDIF"));


Fancy seeing you here! Hope you're enjoying your vacation :)


Static method calls in Java code is a nightmare when testing.


Why? Especially if we're talking stateless methods?


Am I missing something? I regularly write static methods and test them with JUnit.


> I regularly write static methods and test them with JUnit

Implementing mock objects which return "unexpected" results is impossible with static methods.

The standard 1 interface + 1 impl pattern in Java is just so that the Proxy.newProxyInstance can create a decorated or mocked object for testing.

So you can test the methods directly, but you can't write failure-inducing methods (like a connect exception throwing one) which test the methods which use it.


Are they any different in other languages? What's special about Java?


A nice variety of approaches. Any plans for implementing others, e.g., Smith-Waterman and mutual info. similarities?


Thanks, really useful.

Any plans on including this in Apache Commons?

How does it compare to StringUtils in Apache Commons?


Apache Commons Lang has some string distance algorithms. But we started another project in the sandbox for text/strings, as [lang] was getting a bit overcrowded with so many things.

You can take a look at the current project here https://commons.apache.org/sandbox/commons-text/

Source: https://github.com/apache/commons-text


Slightly offtopic, I'm still looking for a good 3-way merge algorithm for strings and structured data in Javascript.


What is the meaning of the dot in the big O notation there? like O(m.n)? Should it rather be O(m*n)?


Dot is a multiplication symbol (in mathematics, that is; one of several). See https://en.wikipedia.org/wiki/Interpunct ; the choice between dot and cross as multiplication symbols is further taken advantage of in vector multiplication (dot product vs cross product). Etc.


That's actually a middle dot, not a regular dot.


Do you really need a multiplication symbol at all?


That was the point precisely


the examples might show better if they were in the form of assertions

very nice variety and good readme! good job




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: