There are a lot of folks reevaluating their crawling engines lately now that Chrome headless is maturing. To me there are some important considerations in terms of CPU/memory footprint that go into distributing a large headless crawling architecture.
The stuff we are not seeing open-sourced is the solutions companies are building around trimmed down specialized versions of the headless browsers like Chrome headless, Servo, Webkit. People are running distributed versions of these headless browsers using Apache Mesos, Kubernetes, and Kafka queues.
I started one of the first 56k dial up ISPs in Nebraska when I was 13 years old with the help of my parents. In high school I partnered with some friends after reading a CNET article about a WISP in Washington state. My friend's dad had a few businesses that needed to share a T1 line so we connected them wirelessly and sold the excess capacity to farmers in rural Nebraska and Kansas. It was an amazing experience. Working with networking gear, setting up FreeBSD servers, learning about NEMA enclosures, antennas, polarity, frequency hopping FHSS vs DSSS, 2.4,5.2,5.8GHz, and 900Mhz. We had our ups and downs and learned a lot about what gear worked the best. My senior year of high school we sold the company. I still browse my google history to feel nostalgia from those days :)
+1. I had a similar experience, starting one of the first ISPs back in Brazil, in 94-95. Me and my two co-founders came from an academic background, so we were using BITNET for a few years by then, and saw firsthand when the internet started to open up for commercial access.
Overall it was a great learning experience, but incredibly challenging. There was nothing like OP's guide (or HN, or Google for that matter), so the knowledge had to come from books, Usenet news, and mail lists. It helped that we had experience managing Unix servers and WAN networks at the University, but still had tons of things to figure out, from setting up dial-up lines, to keeping httpd servers up and running (we decided to use Linux 0.99 beta from the get-go, but it kept crashing, usually in the middle of the night; later we discovered it was a race condition in the multi-serial port card driver, which manifested only when several ports were under heavy use, hence only happening at night).
Although we could solve most of the technical challenges, we were absolutely clueless on how to run a business: selling, bizdev, billing customers, hiring and building teams, etc. I have fond memories for those (usually bad) decisions, but can't stop thinking how things would be different if I could magically go back and do it all over again :)
We quickly reached a few thousand customers, and realized it was turning into a capital intensive business - phone lines in Brazil in the 90's were crazily expensive, plus Cisco routers, server upgrades, etc. We decided to pivot towards B2B, thus becoming one of the first corporate ISPs: web hosting, security consulting, leased lines, some web development.
We sold the company 5 years later, when the market started consolidating, right before the dotcom crash.
That's a great story and incidentally a similar one to another Nebraska ISP I'm familiar with (KDSI). I think they got started when a larger industrial company had excess bandwidth. Sounds like your story was a lot of fun for a high schooler. Who did you sell it to and why?
I know of KDSI! We ended up selling to a guy starting a new business called RCOM in Kearney, NE. I worked with him for half a year getting everything transitioned. We sold our Kansas operations to telco called Nex-Tech. At the time the business was funded by my good friend's dad. Collectively we decided it was best for us to focus on our further education and go to college.
Can anyone recommend a good tutorial or reading on how to setup content-based (visual) image searching using a CNN to process images. I'm looking to build a POC of a reverse image search trained with in-house product data. In the past I've used the imgSeek but it is dated and not using neural nets.
Doesn't this data already exist in the form of ISP data brokers? I'm thinking of data that makes its way to into the hands of some marketing companies that show anonomized URL level traffic for a given website. Essentially giving you ability to see analytics on a website you don't own. Anybody know who the big players/ISP data brokers are?
A old friend of mine and I visited a corn field in Nebraska ten years ago in search of some wreckage left behind from a crash in 1966. With the help of a metal detector we located metal identification plates, various small pieces of metal, and an ash tray (smoking on planes was once a thing). https://en.m.wikipedia.org/wiki/Braniff_Flight_250
Interesting read. I was speaking with some colleagues just yesterday about a potential pet project to identify which of our customers have eCommerce websites.
The concept would involve processing millions of companies names found on the "Bill TO" field of sales records. Then using these records to populate a ElasticSearch index for use with Graph Query API to help further normalize/dedup the company names that share similar string semantics. The next stage of the process would be to scan the normalized, dedupped, list of company names and attempt to locate the company website URL by crawling the first page of Google search results. This would need to be metered because I assume Google would block me if I performed rapid attempts. After gathering a list of company URLs the plan would then shift gears into attempting to identify if any of the companies websites contain the typical components that make up an eCommerce website. Think searching the HTML for all variations of "add to cart", "shopping cart", "my account", etc.
The stuff we are not seeing open-sourced is the solutions companies are building around trimmed down specialized versions of the headless browsers like Chrome headless, Servo, Webkit. People are running distributed versions of these headless browsers using Apache Mesos, Kubernetes, and Kafka queues.