Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: I just got $100k in AWS credits, how should I use it?
109 points by chrischen on May 14, 2015 | hide | past | favorite | 156 comments
I got $100k in AWS Credits, with a 1-year time limit. I built a scrapy-powered image crawler that crawls over 300 art sites and finds the most popular posts with clustering algorithms and perceptual hashing (www.arthunted.com), but in the end it takes at most a few hours of a high CPU-instance per day to scrape and process (at most several dollars per day). At a year it'd barely make a dent.

I'm looking to build something that would make a splash, that would otherwise be constrained by budget, and that would have long-term self-sustaining value after the $100k runs out.

So no arbitrage, reselling, bitcoin mining, etc.

What type of project would require high-storage or high-amounts of processing? What can I build that would only be possible with that much money in infrastructure or compute power? Preferably the monthly budget would be about $10-20k.

Also I have a 40-instance limit on EC2 (which I may be able to raise).




Please oh please help the Tor network. You will be eternally loved.


Haha wow, you could set up a pretty impressive number of relays for that much money.


Do I remember right that the Tor project wouldn't accept a very large number of relays from a single party because it gives too much influence over the network?


IIRC, it takes about 50% of the network to do so. This is a waste of $100k though, as the network is only used by a small number of people, and this would be a permanent sinkhole of the money and these added servers would simply disappear after a year.

This guy/gal definitely needs something more interesting. Perhaps something meaningful, or perhaps just something insanely cool. A temporary addition to an existing worker farm is not a great idea.


It would be great to help Tor and set up a ton of relays. But then again how much can we trust Amazon? And unfortunately AWS doesn't allow exit nodes.

A good alternative is Folding@Home


Try providing branch and bound solvers as a service? Spin up some massive EC2 instances, run something like the CoinCBC solver (or license Gurobi by the EC2 hour), and let people run optimization problems on shared hosts. Charge per minute and assume that you get ~60% utilization on the instances with a queue. Maybe allow problem formulation using the JuliaOpt / JumP metalanguage.

The hard thing about optimization problems is that they take on the order of minutes to run, but you're billed by the hour.

Sounds crazy, but lots of startups - ranging from OnFleet to Lyft Line to Postmates - are probably computationally bound on problems like the Traveling Salesman Problem / bin packing problem / knapsack problem. It's not worth $1400/instance/month to spin up the biggest computation nodes because they would get low utilization, but they still want their problem to solve quickly. If you bill by the minute - then they save money and time.

Implementation would be straightforward - set up a queuing model, a timeout on each problem, and do a callback when the problem finishes. I think it's an untapped market because there has been lots of software developed for ML, but little for decision making built on top of the regression models that ML outputs.


I like your point about software for ML, vs software for decision making. I reckon there is a lot of existing software for decision making, but it is focused around particular domains/industries. A few months ago there was some news about a new startup offering experimental design as a service -- that's another related, seemingly under-explored idea.

Branch & bound solvers (aka combinatorial optimisation) as a service doesn't sound crazy to me, but perhaps one would need to think very carefully about the market.

What kind of customers would this service have?

If you are aiming to win customers by offering a cheaper service than alternatives, then I suppose the individual customers would need to be only the ones where it made more sense to rent this infrastructure instead of invest in their own. I.e., they would need some usage, but not usage heavy enough to justify investment in their own infrastructure, which would be cheaper for them in the long run.

Might have to think about data-security issues too, if commercially sensitive data is being uploaded to the solver back-ends.

Context:

I have worked somewhere that internally runs a service vaguely similar to what you describe. E.g. licensed commercial solver, sitting on a server with a decent amount of memory and compute, used as a back-end by various services to solve sufficiently valuable business problems for clients.

If one built a service like this, another idea is to keep it to yourself, and partner with some operations research consultants, then go directly after the business problems.


Offer a performance testing service.

I saw a company pay $30K for 5 days access to a similar service.

Bootstrap the service by using this script under the hood and improve it overtime: https://github.com/newsapps/beeswithmachineguns

Use your $100K credit for spawning instance but also as a marketing hook: "first 1 day free!"


Or...

Try to build any product that offers larger amount of free evaluation period.

If before 1 year, you have $paid_customers > 100.000, continue, go on, otherwise, just kill it.

Also there is more services in AWS, than EC2 and S3.

Only exploring them gives plenty of ideas... machine learning, transcoding, no- and sql scalable databases, mailing or dns services, world-wide advanced networking and delivery, etc, etc, etc... on an scalable and elastic way, high composability and awesome APIs and docs.

wtf, even you can use that money on yourself, purpose it for your own self-learning of AWS... put your $idea on world-wide high availability, automating all the AWS integrable components, and provision it thinking about time zones and world-wide timezones, usage peaks, etc.

Any ideas related to social-network-effect, or massive concurrency, are nice with an "elastic" architecture. For example, build a game that you can play from different social networks!

Last, if you're free to choose where the money goes, in your situation I could donate a some part, to my favorite opensource projects and supporters. Should feel good.


> I saw a company pay $30K for 5 days access to a similar service.

Do you know what features this service had that made it worth $30k for 5 days access? Did they have some unique and useful features, or was it more about the available bandwidth/RPS that could be generated?


This.

Enterprise pays a boat-load to HP and others for performance/load/stress testing. Someone needs to figure out how to offer serious performance testing at an affordable price.


Which is the company that you are hinting at ??


You could calculate the highest-quality rendering ever of the Buddhabrot (http://upload.wikimedia.org/wikipedia/commons/7/77/Buddhabro...). The way it's generated makes it impossible to "zoom in". You have to process the whole thing.

To make it really make a splash, you can make a an incredible video by iterating through the parameters of the functions - fractals turn into beautiful videos when you manipulate random parameters. This requires rendering the buddhabrot N times for N frames of video.


This is what I would do.

3 Steps:

1. Scrape every possible image along with location data (if available). Save all these images in Amazon storage. It's best if you can scrape photo galleries that include building names, sites, or other location descriptive data. Questionable gray area, but this is a mashup of thousands of images.

2. This is now your photogrammetry grid. Take all those photos and generate 3d scenes from the data you scraped.

3. Open up shop with these 3d assets. Charge for quality of object. Extra money if you make it easy to import into UE4, Unity, or Torque and make it "Ready for the Oculus Rift".


4. Get sued for copyright infringement.


That would make a splash.


It's a new work, with substantial creativity involved. I don't see it getting anywhere.


"Substantial creativity" may get you past being a derivative work in Europe, but not in the US. Copyright law differs between jurisdictions. You'd have to be very careful.

I like the original idea. What I'd do is make sure the resulting images don't have any significant reliance on any small set of originals. So if challenged, you could re-create the scene w/o the challenged images and show a court that the scene is not closely derived from any single source.


Having "substantial creativity involved" does not prevent something from being legally considered a derivative work.


And furthermore, you cannot pay your lawyer in Amazon credits.


Unless your lawyer's name is Ed Felten, your name is Barack Obama, and the Amazon credits can be applied to GovCloud!


... 6. Profit.


thats part of business


You should try getting into a more honest business, then.


Ok. Prove I used Exhibit A in making of this 3d scene.


Well we've got the server logs showing your AWS instance accessing that image on our server. As copyright is tort the usual burden of proof is balance of probabilities [UK, "preponderance of evidence" in USA I think; please correct if this is wrong in your jurisdiction], I'd say that's enough to swing it that you're going to need to prove you didn't use that image ... oh and we have a HN post replying to you suggesting you do this, which swings the balance a touch further.

We can probably have an expert witness testify the scene could use that image (ie they don't visually disagree so much that the scene couldn't have derived from inter alia that image).

Not enough perhaps to prove a criminal case ...


"Here is a subpoena showing the IP address of the AWS instance you controlled along with server logs showing you accessed that image.

Based on public statements stating how you compiled the images and comparison between the client's photo and your image, it is not beyond reasonable doubt that infringement likely occurred."


Unrelated, but are you talking about creating 3d models out of point cloud data or something else?


Nope. Pointcloud is just depth data and potentially color data.

Photogrammetry is the technique of 3d scanning that correlates feature points within multiple pictures in order to back-project a 3d scene.

The more pictures you have of an area, the higher quality the overall scan. So if we have 3000 images of a building's exterior in NYC, we can recreate the building in 3d.

My idea was that, for X thousand images, a single image is a trivial datapoint, and could be easily removed with little loss in quality of scan. It may technically be in violation of copyright, but is used for a substantially different work.

I believe it could possibly qualify as fair use.


Using a SIFT pipeline for photogrammettry, I have successfully recreated small objects, Comet 67P and some buildings from quadcopter pics.

First I search for features with DoG and match them with SIFT , do a bit extra crawling along matched edges and the result is a dense coloured point cloud.

The point clouds are converted to a mesh with poisson surface reconstruction and retextured with fragments of the original images.

The Poisson surfaces are never quite as nice as the point clouds - I am using Meshlab for this part.

Processing a few hundred big images takes ages so I send the jobs up to EC2 for a few hours so each job is usually a couple dollars.

It is pretty useful as a 3D scanner.


That-sounds-amazing. Any chance you could share a link of the end result? Also, any suggestions on how to get started with photogrammetry & computer vision?


would love to chat about this - do you have a contact email?


If you have correlated feature points you essentially have a point cloud no? It's just that photogrammetry adds an additional step of back-projecting the 3d scene?


Train some mad, distributed neural nets or some shit. Gather image/art data and train a net to determine beauty. Solve all possible sudokus so the world can finally be free of that junk. Make cloud-driven instant facial recognition (via social media images) a thing. Build a huge pi-as-a-service, and use it with pifs (https://github.com/philipl/pifs)

Bonus points: worlds biggest lolcat host.


The training of the models is expensive, but after you have the parameters, saving and using the model becomes cheaper (which meets your needs).


Actually I like the gimmicky ideas. They are usually the most newsworthy.

Is solving all possible sudokus actually possible? I can see that making the tech news rounds. Though I suppose you could just solve them on-demand.



Possible? Sure.

Brute-force solution is to consider all possible arrangements of numbers 1-9 in a 9x9 grid, filter that down to those that match the conditions of a completed grid, then iterate over the list of possible completed grid, for each grid find all possible arrangements of missing numbers, filter over those arrangements for those that can be solved, which means calculating whether there exists a solution from the available information, etc.

I didn't say efficient.


A cursory search indicates that there are 6670903752021072936960 distinct 9x9 sudoku. That's a whole lot of cycles and a bit of a storage issue.


Genome sequencing requires a lot of CPU power and disk space. You could build an application that performs these tasks and use the first $100k in compute power to help it grow. There is a lot of domain-specific knowledge in this field and I hear bioinformatics is difficult to monetize.


I suggest you take your algorithm and apply it to other niches. There might be good horizontal scalability there. For example why not try to crawl photography sites to discover popular photos and sort them by style or "taste"

You could then create your own photo discovery service and call it Find my Style or something like that.

There are two ways to do this, you can either directly scan the photos looking for graphical patterns or you could analyze the text into which the photo is posted.


I like this idea. Crawling art/photos online, classifying them, and otherwise mining the data would tie in well with my existing business, and also effectively convert $100k of computing into reusable stored assets.


Do something that matters: https://folding.stanford.edu/


In a similar vein: https://boinc.berkeley.edu/


You can try turning 100k into 150k:

https://www.eff.org/awards/coop


Unless OP finds some way to turn our understanding of prime numbers on its head, the best case in one year is turning the 100k into 50k by finding the first prime with 1,000,000 digits.


Is that prize still being offered?

I think it was claimed in 2000, "The $50,000 prize will go to Nayan Hajratwala of Plymouth, Michigan, a participant of the Great Internet Mersenne Prime Search (GIMPS), for the discovery of a two million digit prime number found using the collective power of tens of thousands of computers on the Entropia.com network."

https://www.eff.org/press/releases/big-prime-nets-big-prize


The article linked refers to the smallest of four prizes offered.

$50,000 to the first individual or group who discovers a prime number with at least 1,000,000 decimal digits (awarded Apr. 6, 2000)

$100,000 to the first individual or group who discovers a prime number with at least 10,000,000 decimal digits (awarded Oct. 22, 2009)

$150,000 to the first individual or group who discovers a prime number with at least 100,000,000 decimal digits

$250,000 to the first individual or group who discovers a prime number with at least 1,000,000,000 decimal digits


Unfortunately that prize has already been awarded :(.


Current largest prime: 2^57885161-1

So take random prime numbers larger than 57885161 (such as 57885167), and find a script that can calculate with numbers that high using EC2s server constraints, then see if 2^(large_prime_numb)-1 is prime. Is that the correct method of doing this?

https://www.eff.org/awards/coop/primeclaim-43112609

What are the stats on testing large prime numbers on EC2 instances?


It would take a while to setup, but offensive infosec people need virtual networks to do pen testing against because they can't pen test all day every day against their own corporate network. Let the user choose a single machine or group of machines to try to compromise. Go for low cost to attract more users or go after big corporate training budgets a la sans.org

Put me in a lab of vulnerable servers and I will spend all weekend trying to get a Admin / root shell and learn way more than any books would have taught me.

BTW, computer security is sort of taking off on indeed.com http://www.indeed.com/jobtrends?q=metasploit&l=&relative=1


Amazon is not super keen on people pentesting from or against their infrastructure.

You will be better using labs available as ISO or VM.


All pen testing labs I have seen are VMs


You could potentially help millions of people by computing bite size versions of Wikipedia English each month by running Pagerank on it, then rendering a list of the top articles to HTML, and encoding in some way that's quickly decompressable on mobile.

Network connections are often slow but Wikipedia English is often really handy but too big for lots of people to store offline.

I have some code based on Sean Harnett's work here: https://github.com/lukestanley/wiki_pagerank

Additionally Wikipedia has awesome stats on pageviews that need crunching - there is a wealth of cultural, zeitgeist info that can be parsed, and used to priorotise with more than Pagerank.


Use it to build web scrapers hunting for AWS Keys on GitHub, to spin up more instances to scrape for AWS keys to..


AWSception


use it to help the world, throw up a few thousand Tor relays/bridges.


You could make something that let other people spawn instances in your cloud (e.g. via Cloud Formation). For instance, take <random open source project not available in AWS>, and let people easily spawn clusters of that. In this way you can run your beta phase for free, test, then set the price to the public.


One idea I had was to turn http://www.arthunted.com into a service. So you can specify parameters and sites to crawl and gather data, on demand. $5 a day times xxxx users.


scrapinghub seems to be the top dog in this space. You might want to check it out.

What's your background? Are you looking for business idea that can be built upon your existing site, or are you looking for something completely new?


It doesn't have to be a business idea! Just an interesting idea that would otherwise require a lot of compute or storage, or something else provided by AWS.


chrischen. I am also facing similar situation. So, which out of all the suggestions did you found the most interesting one ? And what have you actually planned doing with it ?


WHOA. This is like google-scale computing power... AI for sure. Let me do some research, but here are some ideas:

disclaimer: these are a bit of a stretch to be sure, but hey

Never ending learner: Part 1: Machine vision on video + language learning on web + pattern recognition => new semantic constructs Part 2: semantic constructs + reinforcement learning + intrinsic exploration => action selection Result: An agent that can take in real-world scenarios, including language, text, and vision, and learn to do tasks by itself, improving how it learns as it goes. EPIC.

Autonomous programmer: code with comments (from github, community-based labeling, etc) + machine learning + knowledge representation => "understanding code" machine + natural language => PROFIT Result: you can say, "what's the sales this quarter?" and it'll deduce logical steps (parse, read from db, etc) and tell you the answer


do a better version of iThenticate which helps to prevent and find plagiarism in published content. We use it to help verify that content someone has sent to us is unique and is not just copy pasted in part or whole. It also helps us find uses of our published content and course material that has been re-purposed or copied verbatim. Entry level price point for iThenticate is about $5k per year. And unfortunately copyscape.com is not the same as iThenticate.


some considerations

- budget for bandwidth... esp if you are doing something more than text crud and want to serve it - e.g. image/video

- video is "heavy" and so necessarily takes a lot of compute, ram and storage and can also leverage gpu and pricier instances, depending on what you are trying to do... e.g. understanding video content with opencv/opencl or specific types of drawing like raytracing

- instance limit increases with aws are perfunctory... so i wouldn't consider 40 a limit, certainly don't design anything interesting with that limit in mind

- spot instances can save you a lot and stretch that $100K 1.5x - 4x depending on region, availability zone, and instance type

- unless you've done it before, time is your enemy to get into position to spend that money on something useful... so your monthly budget target range makes sense


Go talk to the guys at ClusterK.com for their balancer device. It will logicall auto-balance your spawn of spot instances across many AZs and make you very resilient. Tell them you want to prove out their balancer.

Do this, because intelligently following the cheapest spot price will save you 90% of the typical costs.

THis will make your 100K be able to support a ton more instances than you would otherwise.

Make sure that whatever you do though is not chatty between the nodes - if you have a ton of instances talking across zones the data transfer fee can be significant.

Make sure that you store large data on an EBS that an instance mounts to prevent large transfer fees between instances and S3.

All instance limits can be raised. The only hard limit in AWS is 100 S3 buckets per account.

Just put in a limit increase request via the console. Email gilleyt@amazon.com if you have issues.


>>> I got $100k in AWS Credits

May I ask how ?


Its a credit given to companies backed by an accelerator.


Even ones that can't think of anything to do with it? I wish I had these problems :-/


It's given out like candy, if not the 100k one you can surely get several thousands worth of credits for server hosting, either on Azure or through Amazon.


Can confirm - with my MSDN license I get a $150/mo credit for any development/test instances. As long as I have a valid license (and they offer the deal) I'll get the credit.


which one?


I know Startupbootcamp has these deals for their batches. A similar deal is available for Google business users.


Look up some of the prime factorization prizes and do the math to see if it is achievable.


I would invest it in 5 early stage Startups - 20K computing time each.


Well, if you're literally not going to use it - and you'll lose it at the end of a year... Can you donate it to Folding@Home or something?

I'm guessing AWS wouldn't want you to...


why don't you use the credits to try to create a publicly queryable index of the web in a standardised format? read: open source search engine. as you've got the money, just ignore efficiency.

else... commit part of it to one of the computing @ home proejcts?


If you're interested in a publicly queryable index of the web, you could try running a search server such as ElasticSearch on the Common Crawl[1] corpus. ElasticSearch runs the search backend of WordPress, 600 million+ documents in total[2], so extending it to a Common Crawl archive seems possible.

n.b. I'm a data scientist at Common Crawl, so have a vested interest!

Also, whatever experiment you end up pursuing, remember to use spot instances if your setup allows for transient nodes - it'll substantially decrease your burn rate (usually 1/10th the price) allowing for even larger and more insane experiments :)

[1]: http://commoncrawl.org/

[2]: http://gibrown.com/2014/01/09/scaling-elasticsearch-part-1-o...


I had a crawling project where I wanted to get a sense for a few ad-related things on the internet and came upon common crawl and was initially excited since I thought it would have incidentally captured the data I wanted, but I was disappointed to find that they did not do any kind of JS execution, which limited the effectiveness for me pretty drastically.


I'd never heard of Common Crawl before but it looks like an awesome project! Keep up the good work!


How up-to-date is commoncrawl data?


The idea we had at my work was to setup a web service that just puts urls in your s3 bucket. Nothing more, nothing less.

Filepicker's storeUrl function without the baggage. Designed for server side apps that prefer to avoid streaming / download-uploading files locally in order to get them into a bucket.

Non trivial compute resource value add that would take very little time to code. Low risk for adopter: their files are stored in their bucket.


Whatever you do, only use spot instances.


I assume you had two options, 100k for a year or 10k for 2 years (I think)? (I've seen these offers)

Is that the case? And if so, why not take the other option?


$100k is so much bigger number.


Sure but it's like saying "here's a meal for 20, you have to eat it all today... or I'll feed you and only you for a year"


No, it's like saying "here's $20, you have to use it all today. Or I'll give you $4, half today, half tomorrow".


Take the meal for 20 and host an improtu banquet. Charge a per-person fee and make some money.


I'd say do this http://lg.io/2015/04/12/run-your-own-high-end-cloud-gaming-s... and make it super affordable with generous free plans.


You could build a video conversion website where one uploads original high resolution video and it'd spit out 1080p/720p/320p and other formats of video that are suitable for delivering on different devices/bandwidth. This could be an alternative for people hosting video on youtube and getting slapped with ads. An effort like this would use a lot of CPU but once it's converted it's just the storage cost. Common challenges are copyright issue but I can see different ways to promote it as a professional service than collection of random videos. $100K would cover cost of offering it for free to people for a limited period.


As strange as it sounds, that much bandwidth would probably suck up $100k faster than you'd imagine. There's a reason a lot of large companies that deal in video have their own data centres and crazy bandwidth deals that make it cheap. I don't think $100k at Amazon's prices would last more than a few months if the service became popular.


That sounds pretty much identical to Zencoder, encoding.com, etc.


>> ... no arbitrage, reselling, bitcoin mining

You just scrapped my first 3 ideas :)


Actually it's interesting task to come up with creative way to convert amazon credit to hard cash (that does not expire).

I don't think bitcoin mining via CPU (virtualized especially) is feasible for that anymore.


Altcoin mining on the amazon gpu machines was still viable, last I checked.


Well, it WAS, until Scrypt ASICs came out about a year ago. Now it's not viable. You'd be lucky to get a percent back.


Would running a PaaS on top of EC2 be considered "reselling"?


Only if it is purely a subset of EC2.


Do what you like -- but reserve instances so you can do it for longer.


Sadly,

> You may not use Promotional Credit for any fees or charges for Reserved Instances, Amazon Mechanical Turk, AWS Support, AWS Marketplace, Amazon Route 53 domain name registration, any upfront fee for any Service

[1]: https://aws.amazon.com/awscredits/


Artificial intelligence? It needs huge datasets and lots of CPU to train. Then you could put it to work, maybe captioning video or something.


Solve chess. Write a paper: "White wins."


I will let others answer "how should I use it" what I want to know is how did you get 100k in AWS credits.


You could experiment with servers farms and EC2 spot instances to figure out strategies for minimizing costs of maintaining huge server farms. Once learnt, you can sell that skill to multiple companies and with that money gained just do your next thing like giving t charity, having a beer etc..


Host art projects. Let people render fractals. Render video walkthroughs of highly detailed scenes.


Build an artificial life program, and use it to discover and optimize algorithms or electronic circuits relating to a problem that interests you.

If you don't have a problem of your own, build a better internal power supply for consumer electronics devices.


Build fuzzing infrastructure, to find bugs that people will pay for (example: google chrome). There's a revenue stream, interesting technical challenges, and you're helping raise the security bar.


For those who want's to know what this 100k deal is...

http://aws.amazon.com/activate/benefits/



Put the computer power toward some good cause -- Folding@home or similar.


3d render farm?


He's a bit limited on instances though.

Still, could find some animation studios, offer them your farm for some % of profits.

Pretty big gamble though.


For those wondering how he got this much in credits:

http://aws.amazon.com/activate/


Would it be possible to create a site that shows in near real-time what images are being shared or are popular at any point of time?


We also got the $100k AWS credits for our startup. We're going to be using it to pay for our server and CDN costs :)


Well by the time I got it, I had already paid for reserved instances...


How did you manage to get that much credit?


But how did you manage to get that much??


would you mind sharing some details about how you are scraping?

I am also trying to build a crawler. The problem is each site has it's own html structure. How do you handle this? Have you written scraping rules for each site? Which is a nightmare to maintain especially when you have lot of sites to crawl.


I think you want to have a system that can use XPath or CSS queries to select the elements you want.

This way writing a scraper for a given page is almost as easy as right clicking on an element in dev tools and selecting "Copy as XPath" for what you want.

You definitely need some validation that your scraper is still returning accurate results, so that you can get notified when things go wrong. Things like following links from an item to the item's product page and comparing scraped prices, names & images should get you a lot of the way.

At some point this will definitely get unwieldy, and you can try to build a more general solution that can understand grids or layout, but despite my preference for this as both a shopper of long tail sites and a developer, this is probably not where you want to start unless the long tail is your actual niche.


You recommend staying specific to few categories instead of crawling over everything available on internet? We are starting only with women's clothing.


I was more referencing the typical approach that people took of supporting the top N most popular sites and increasing N as they got bigger.

It's a solid approach for hitting the majority of the market, and works fine for alerting, but this leaves a pretty big gap in the market for people who are interested in comparison shopping for more boutique items, e.g. designer male fashion get sold by piles of different boutiques, each with their own sales, etc, but the items are exactly the same, and I would really like to know when something I am interested in goes on sale at one of the 50 different stores that have this item, and I would like to know this only when they go on sale in my size, and whether it's actually cheap after currency conversion and shipping. A person can dream, right?

Shit, I would love it if there was a platform that could guess my size across various items in different brands.

I've thought of this space a bit since I buy a decent amount of clothes, but I've never gone ahead and tried to execute.


If you're building a general purpose crawler, use a regexp to select the content of the body tag, then strip out all the tags inside it. You'll be left with a long string of words that you can then index... Tags, generally speaking, are unimportant if you're not rendering the content.

Of course you might want to leave some tags in, like links and titles. They convey more than just layout.


Thanks for your reply. I am building a price comparison\alert engine so I am interested in product description, price, images or anything else closely related.


Shameless plug: You might be interested in checking out Diffbot (http://www.diffbot.com/). That use case is exactly what it was built for.


I just used scrapy. It lets you query using xpath or css selectors.


How about not using it because AWS doesn't run on 100% renewable energy. Make a $100.000 statement!


You could build a A middle-out compression algorithm which would make data storage problems smaller.


I would suggest you buy servers in reserved contract and sell it on to others.


If someone would have an idea how to get value out of $100K in AWS credits, would you think they would tell you? Why wouldn't they take the idea, pitch it, get $100K in AWS credits and then double it themselves?


Because he's not asking for ways of generating more than $100k profit out of his $100k AWS credits which would be quite difficult. He's in the unique situation of having $100k in free AWS credits and would probably be happy getting back half that as cash.


Well I'm less interested in converting it to cash, as I could easily just start reselling the credits.

I'm trying to do some hacky project that would make a big splash... as there's $100k of value to be consumed.


Out of curiosity how'd you get the credits?


Any idea, where you could resell the credits ?


Because humans are many times irrational and lazy.


In a word: Risk


Open up the machines and post credentials here :-)


Do something with video. That could suck up $100k


Jarvis, of course


Why can't you mine bitcoin?


If you do the math, you won't get much more than 4-5k$ in bitcoin...


As above, but also mining crypto's is against AWS terms of service.


Citation please. I just searched the AWS ToS, AUP, and customer agreement and found no such restriction.

Or do you mean that the startup $100K credit is restricted from obvious reselling/mining/anything that's not doing a value add? Cause of course AWS doesn't want to give away money, but promote startups to actually use their services.


Why is that? Are they oversubscribed?


Because that only makes ~1,000$ or less which seems like a huge waste.


Porn Site.


Host a massive minecraft server.


gpu instances -> bitcoin mining


DDoS a small country. Even if for a day.


train a convnet to recognize 80M tiny images: http://groups.csail.mit.edu/vision/TinyImages/


Resell it. Turn it into real money.


mine dogecoin? (j/k)


The altcoin ideas are getting downvoted, but with the right alt coin he could make $10k+ in the bank to keep.

Anything related to creating an ongoing service whether for love or profit, is madness, because in 12 months that service has to stop unless the OP has $100k to spare to keep it going for the next year.


What are the right alt coins that your hinting at ?


How about an altcoin miner cluster ? :D


crypto-currency mining? You could burn through that very quickly with enough machines.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: