There actually is a limit of 100 pages per submission, however it is obviously not storing the data for those 100 pages too well.
Looks like tomorrow is de-bugging day!
I wrote this app over the course of a day a couple of months ago and it's been languishing on my server since then. At the moment it's written in PHP and is not exactly what you'd call 'well tested'.
I'm considering migrating it to Rails (or Camping). Any suggestions for mini-frameworks (like Camping) suitable for a single-page app would be welcomed!
Framework isn't the issue in this case, it just looks like you're loading all 100 pages into memory at one time? Or keeping the previous pages in memory as you go "deeper" into the hierarchy.
I recommend you just keep a few pages in memory at any given time, keeping a simple list of the URLs scraped out of that HTML. If you're using PHP remember to unset() the variables storing the HTML and so on.
I haven't used PHP in a long time and of course I'm just speculating based on the error message, so slight disclaimer there.
I also meant to say:
re: "Framework isn't the issue in this case"
I agree. It is possible to turn out nice code in PHP and write fast, reliable apps. That being said Rails has me addicted to the idea of unit tests within easy reach. I'm not exactly a massive fan of any of the unit testing systems that I have seen for PHP and that is my main reason for wanting to switch away in this case...
I use PHP all the time, I am just a little dissatisfied with it in light of the alternatives that are out there.
Yeah I'm pretty sure that I was storing all of the raw HTML for possible further analysis down the line. I haven't even looked at the code in months but it definitely needs over-hauling!
Initially it started out as a 'what if I do this' kind of app. I would like to re-write the backend with more attention to detail...
Good suggestion!
I've been meaning to try out AppEngine for a while now but haven't had the chance. It would also be nice to keep any load away from my servers...
I wrote this app to streamline the process of checking medium/large website for (X)HTML compliance... I basically want to get it out there so that other web designers/developers don't have to go through the pain of checking everything manually!
Let me know if you have any suggestions for improvements or find any bugs.
I find that once I have the site chrome (structural features) valid then I don't really care too much about small errors that don't impact the appearance. When you've got code dragged in, with javascript calls, that doesn't validate anyway (Google, etc.) ...
These things always give millions of entity errors (other people dictate the URLs used) perhaps an intial view that just says which errors are on a page (broad categories) - entity errors, failed to close a tab, incorrect attributes, etc.. with drill down to the error details.
These details in the summary would be good too, as would the number of clean pages.
The reason that both validators are consulted is because they can pick-up different errors. The WDG validator is _much_ more reliable when it comes to finding unicode issues and that has saved my ass at least once.
Your other thoughts are things I'll keep an eye on when I do the re-write.
That is certainly a valid way of doing it and is the methodology that I have used in several search-engine projects in the past (like this: http://intrasitesearchsupport.com/)
However in this case I am offloading the work of building the site-map to the WDG Validation service as I didn't want to have to obtain new servers to provide a free service. This means that I don't get the site-map until the WDG results come back...
Difficulty with the W3C validator is actually why I designed this. Their validator doesn't allow you to check all of a site in one go. You have to go through page by page. Even on a relatively small site that can take a lot of time!
Fatal error: Out of memory (allocated 47710208) (tried to allocate 35 bytes) in /home/aarongou/public_html/easy_web_qa/simple_html_dom.php on line 760
SUGGESTION: Sane-itize input.