Do you end up with lots of duplicates when you are scraping? If you also scrape IG, YouTube and LinkedIn, would you link them all to the same influencer?
That might be quite an interesting identity resolution challenge (disclosure: I build identity resolution tech).
I would not mind taking a look. Always interested to see how others are handling such data.
Hello HN! We (Steven, Hendrik and Stefan) built a real-time identity resolution system that can handle hundreds of millions of customer records, and recently launched a LangChain integration to use it as a RAG source for LLMs.
We built this while working at a European credit bureau, where we needed to deduplicate and match millions of monthly record updates from various sources. Traditional approaches using graph databases and Spark couldn't handle the scale, so we built our own solution using AWS Serverless.
Each identity is stored as an individual graph structure, using rules-based and ML matching. Performance: <300ms ingest (tested to 5,000/sec), <150ms search regardless of graph size. Several fintech companies use it for fraud detection, KYC, and customer 360.
Unlike vector databases which can blur similar entities together, IdentityRAG maintains distinct customer identities while pulling data from multiple systems - even when customer details differ across databases.
You can try it out with our sample chatbot in the Github repo (linked above). Free to sign up, we charge based on number of unified customer records (it is free for playing and testing). We would love to hear your comments and questions.
There is also a demo video in the repo and you can find more details about us here: https://tilores.io/
Yes I know we can register a UG, but in the end you don't. And it is not just the share capital that is annoying it is everything else.
It literally costs 10x more to do the bookkeeping for a German company vs a UK one. Plus getting investors is much more difficult because of the notary requirements.
We used a SPV for our first round, but even that is annoying. I had some angel investors pull out purely because we are a German GmbH.
just to be clear - the 61 duplicate voting cases were only for Ohio and Pennsylvania - the 400k duplicate profiles were across all 7 states we looked at.
Indeed there is certainly not mass voter fraud. We were glad not to find that, but tbh surprised that we found any at all. Originally we were only going to look for duplicate profiles - it didn't even occur to us to look for actual fraud.
But why not make it a complete non-issue? It would be so easy to fix this data so there were no duplicates, then there would not even be any accusations like there were in 2020.
What I want to create is complete trust in the data to avoid the... bickering later.
/edit - as the poster below mentions, the 61 were just the ones that were manually confirmed. There were 1000 potential cases.
We come from Germany - where there is unlikely to be a big issue, as citizens have to be quite careful about registering where they live in one place only.
I suspect the data in the UK (where I originate) would be pretty messy. The voting lists there are a free for all, I reckon!
This is a problem I also have as a founder of a German company. Meetup literally can't add the company name to the invoice unless it is in the credit card name.