Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Review my mini-app. Easily validate all the (X)HTML pages in a website. (aarongough.com)
12 points by aarongough on June 3, 2009 | hide | past | favorite | 21 comments



INPUT: http://google.com/

Fatal error: Out of memory (allocated 47710208) (tried to allocate 35 bytes) in /home/aarongou/public_html/easy_web_qa/simple_html_dom.php on line 760

SUGGESTION: Sane-itize input.


Haha, Ouch.

I hadn't thought of testing it with something quite as insane as Google. Nice catch!


SUGGESTION: Have a limit on how far down (and across) you'll recurse

I got the same error when trying CNN.com


There actually is a limit of 100 pages per submission, however it is obviously not storing the data for those 100 pages too well.

Looks like tomorrow is de-bugging day!

I wrote this app over the course of a day a couple of months ago and it's been languishing on my server since then. At the moment it's written in PHP and is not exactly what you'd call 'well tested'.

I'm considering migrating it to Rails (or Camping). Any suggestions for mini-frameworks (like Camping) suitable for a single-page app would be welcomed!


Very cool tool. Useful too. =]

Framework isn't the issue in this case, it just looks like you're loading all 100 pages into memory at one time? Or keeping the previous pages in memory as you go "deeper" into the hierarchy.

I recommend you just keep a few pages in memory at any given time, keeping a simple list of the URLs scraped out of that HTML. If you're using PHP remember to unset() the variables storing the HTML and so on.

I haven't used PHP in a long time and of course I'm just speculating based on the error message, so slight disclaimer there.


I also meant to say: re: "Framework isn't the issue in this case" I agree. It is possible to turn out nice code in PHP and write fast, reliable apps. That being said Rails has me addicted to the idea of unit tests within easy reach. I'm not exactly a massive fan of any of the unit testing systems that I have seen for PHP and that is my main reason for wanting to switch away in this case...

I use PHP all the time, I am just a little dissatisfied with it in light of the alternatives that are out there.


Yeah I'm pretty sure that I was storing all of the raw HTML for possible further analysis down the line. I haven't even looked at the code in months but it definitely needs over-hauling!

Initially it started out as a 'what if I do this' kind of app. I would like to re-write the backend with more attention to detail...


Check out Sinatra for the micro-framework, and RestClient to get the websites.


How about Google's AppEngine? It scales quite nicely if you write your stuff in a non-braindead way.

A public service like that is a perfect candidate for their architecture, I think.


Good suggestion! I've been meaning to try out AppEngine for a while now but haven't had the chance. It would also be nice to keep any load away from my servers...


I wrote this app to streamline the process of checking medium/large website for (X)HTML compliance... I basically want to get it out there so that other web designers/developers don't have to go through the pain of checking everything manually!

Let me know if you have any suggestions for improvements or find any bugs.


Is it really that different to the results with http://www.htmlhelp.com/tools/validator/ ? You're dragging in W3C results too it seems.

I find that once I have the site chrome (structural features) valid then I don't really care too much about small errors that don't impact the appearance. When you've got code dragged in, with javascript calls, that doesn't validate anyway (Google, etc.) ...

These things always give millions of entity errors (other people dictate the URLs used) perhaps an intial view that just says which errors are on a page (broad categories) - entity errors, failed to close a tab, incorrect attributes, etc.. with drill down to the error details.

These details in the summary would be good too, as would the number of clean pages.

I don't like the dark theme.


The reason that both validators are consulted is because they can pick-up different errors. The WDG validator is _much_ more reliable when it comes to finding unicode issues and that has saved my ass at least once.

Your other thoughts are things I'll keep an eye on when I do the re-write.

-A


Useful tool. While waiting for results, a progress bar would definitely be more useful in terms of feedback than the spinning logo.


Noted! It may not be feasible for the first stage of the process though as the system does not actually know how many URLs are in the site.

I'll definitely have a look into that though.


You might want to look at doing this in several passes:

- Build a site-map, like parsers build a syntax tree.

- Follow that to validate one page at a time.


That is certainly a valid way of doing it and is the methodology that I have used in several search-engine projects in the past (like this: http://intrasitesearchsupport.com/)

However in this case I am offloading the work of building the site-map to the WDG Validation service as I didn't want to have to obtain new servers to provide a free service. This means that I don't get the site-map until the WDG results come back...


Curious: Why would people use this instead of the W3 validator?

And for your fix list:

it's entirety -> its entirety.


Difficulty with the W3C validator is actually why I designed this. Their validator doesn't allow you to check all of a site in one go. You have to go through page by page. Even on a relatively small site that can take a lot of time!


An extremely useful tool. Thank you so much.


No, thank-you! I just hope that it saves some developers from unnecessary pain!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: