For reasoning about tree-based data such as HTML, I also highly recommend the declarative programming language Prolog. HTML documents map naturally to Prolog terms and can be readily reasoned about with built-in language mechanisms. For instance, here is the sample query from the htmlq README, fetching all elements with id get-help from https://www.rust-lang.org, using Scryer Prolog and its SGML and HTTP libraries in combination with the XPath-inspired query language from library(xpath):
In a certain sense (for example, when measuring brevity), it is indeed easy to write this example in Python. However, the Python version also illustrates that many different language constructs are needed to express the intended functionality. In comparison to Prolog, Python is a quite complex language with many different language constructs, including loops, objects, methods, assignment, dictionaries etc. all of which are used in this example.
As I see it, a key attraction of Prolog is its simplicity: With a single language construct (Horn clauses), you are able to express all known computations, and the example queries I posted show that only a single language element, namely again Horn clauses to express a query, is needed to run the code. The Prolog query, and also every Prolog clause, is itself a Prolog term and can be inspected with built-in mechanisms.
As a consequence, an immediate benefit of using Prolog for such use cases is that you can easily reason about user-specified queries in your applications, and for example easily allow only a safe subset of code to be run by users, or execute a user-specified query with different execution strategies etc. In comparison, Python code is much harder to analyze and restrict to a particular subset due to the language's comparatively high syntactic complexity.
I don't think the op's point was "how easy it would be to hire developers", or even "taking all the considerations a business is under, I feel Prolog makes sense". He was just touting how easy Prolog's built in pattern matching and declarative style makes implementing and using selectors at a language level.
Honestly, if we didn't talk about the benefits of a language irrespective of how easy it is to hire for it, we'd never have introduced anything beyond FORTRAN, if we even made it that far. Bringing "X is easier to hire for" into a conversation about the language is, at best, a non-sequitur.
We might have been better off that way. FORTRAN does have its downsides, but language churn itself has downsides that almost always outweigh the assumed upsides of a better language.
If we had just stuck with FORTRAN forever, how many problems would have been completely avoided!? There’d be better, and more, IDEs, since even if the language is hard to parse, it’s still just one parser that needs all the effort. So many unfortunate problems in education caused by language and ecosystem churn would have been avoided (the infamous “by the time you graduate, it’s always outdated” problem).
The only problem is that FORTRAN is too new. Should’ve stuck with the Hollerith tabulator.
Fun aside: In practice I've found that most people touting what's easy to hire for -vastly- overestimate how difficult it is to pick up a new language sufficiently well to be productive in it and able to support it in production. This is doubly amusing when you consider that the same people also frequently tout how they want to "hire the best".
Thanks, that's a rare example of something which is (a) simple enough to understand for a Prolog-newbie like me, and (b) more practical than ubiquitous family-tree example.
I'm always looking for opportunities to dip my toes into Prolog; in hindsight it's clearly a good fit for tree-structured data structures.
Interestingly, the only other context in which I've come across Prolog is from friends who studied at Cambridge, here in the UK. For some reason, the CS 'tripos' (course) there is really heavily focussed on Prolog, and everyone I know from there ended up a huge fan of the language. I'm not sure why that's the case, though, given that almost all other universities seem to use more common languages (Java, C++, etc).
"Prolog as a library" => Given "functional" constraints => $CONSTRAINTS.prolog( "query..." ) => results
...many languages (similar to regex / state-machine) can benefit greatly from offloading a portion to something prolog-ish, but it's unfortunate that prolog knowledge isn't as widely distributed.
I studied CS at a different university in UK and we used Prolog for one module on AI or perhaps machine vision. I really enjoyed working with it. This was 15 years ago. Looking through their current curriculum I can't see prolog being mentioned anymore. Shame!
cs.man.ac.uk, at least back in 1992, had a compulsory Prolog module in the first year. Don't know anyone from then who didn't hate that module with a burning passion.
(There was no Java, C++, etc. either. It was SML, Pascal, 68000, and Oracle Pascal-Embedded-SQL.)
AFAIK, this was first proposed and implemented in Ciao Prolog back in late 90s (modern versions here: https://ciao-lang.org/ciao/build/doc/ciao.html/html.html). It was way before Python was popular and JavaScript ever existed.
I tried to run this on my computer now, but as a complete Prolog noob, I'm having errors running the script? How do you load the http_open module/library in the first place? I tried following some Prolog tutorials in the past but I always get stuck trying to run something in the REPL. I'm using scryer-prolog. Thanks in advance!
The libraries I mentioned can be loaded by invoking the use_module/1 predicate on the toplevel, here is the complete transcript that loads the SGML, HTTP and XPath libraries in Scryer Prolog:
The second query also uses portray_clause/1 from library(format), which you can load with:
?- use_module(library(format)).
true.
After all these libraries are loaded, you can post the sample queries from above, and it should work.
There are also other ways to load these libraries: A very common way to load a library is to use the use_module/1 directive in Prolog source files. In that case, you would put for example the following 4 directives in a Prolog source file, say sample.pl:
You can then again post the goals from above on the toplevel, and it will work too.
Another way is to put these directives in your ~/.scryerrc configuration file, which is automatically consulted when Scryer Prolog starts. I recommend to do this for libraries you frequently need. Common candidates for this are for example library(dcgs), library(lists) and library(reif).
Personally, I start Scryer Prolog from within Emacs, and I have set up Emacs so that I can consult a buffer with Prolog code, and also post queries and interact with the Prolog toplevel from within Emacs.
This looks very useful, big fan of all the ^[a-z]+q$ utilities out there. But as a user, I would probably want to use XPath[0] notation here. Maybe that is just me. A quick search revealed xidel[1] which seems to be similar, but supports XPath.
I'd like to state my support for the author's choice of CSS selectors in this particular use case. I think it's a natural fit for this domain and already very well known, perhaps even known better than XPath.
I'd like to add my support here too, but with a note.
When scraping and parsing (or writing integration test DSL), I always start out with CSS selectors. But always hit cases where they lack or require hoop-jumping and then fall back on Xpath. I then have a codebase with both CSS-Sel and Xpath, which is arguably worse then having only one method.
I suspect here, one uses this tool untill CSS selector limitations are getting in the way, after which one switches to another tool(chain)
XPath does general data processing not just selection
E.g. when you have a list of numbers on the website, XPath can calculate the sum or the maximum of the numbers
Or you have a list of names "Last name, First name", then you can remove the last name and sort the first names alphabetically. Or count how often each name occurs and return the most popular name.
Then it goes back to selection, e.g. select all numbers that are smaller than the average. Or calculate the most popular name, then select all elements containing that name
Like other commentor says: parent/child. But also selecting by content (e.g. "click the button with the delete-icon" or "find the link with '@harrypotter') or selecting by attributes (e.g. click the pager-item that goes to next page) or selecting items outside of body (e.g. og-tags, title etc). All are doable in CSS3 selectors, but everything shouts that they are not meant for this; whereas xpath does this far more natural.
The element(s) before an element: //h3/preceding-sibling::p[1]
Match something's parent: //title/..
Match all ancestors: //title[@id = 'abc']/ancestor::comment
Element with src or href attr: //[@src or @href]
or multiple conditions: //article[@state = "approved" and not(comments/comment)]
Element with more than two children: //ul[count(li) > 2]
Element with matching descendents: //article[//video]
Element text containing substring: //p[contains(text(), "Foo")]
Attribute containing substring: //a[ends-with(@href, ".jpg")]
Attribute values: //a/@href
Text values with spaces normalised: //a/normalize-space(text())
Match all attributes or elements or text nodes: //user/@ or //user/node() or //user/text() or //user/comment()
Basically from any node in a document you can select its ancestors, children, descendants, siblings, attributes etc, and filtering has the same power as selecting does - in CSS there's :not() that can apply to selection or filtering, with :has() finally on the way and no :or(). CSS selectors match against HTML elements and they're great for that almost all of the time, but while you can filter by attribute value including substring and even by regular expression, for text there's :empty.
But for a query syntax you need to be able to select attributes and text content as well as elements. Either extend XPath to support #id and .class syntax
//#user-xyz//note/text()
//code.language-js/@name
or extend CSS to at allow selecting attrs and text
#user-xyz note :text
code.language-js @name
The former is more powerful, the latter a quick hack (if they only appear at the end of the selector anyway) with instant payoff.
You can do it either way in XPath thanks to how you can use a path expression and/or predicates almost everywhere in a query
# Find all elements li and select the parent element for each
//li/..
# Find all element nodes with a child element named li
//*[li]
# Non-abbreviated queries
/descendant::li/parent::*
/descendant::*[child::li]
# CSS using :has
:has(> li)
Playwright ppl had to solve this for themselves, you can mix them as they are distinct, have few small custom modifications to help with selectors. Playwright compatible selectors would be nice.
My web scraping tends to start with xidel. If I need a little bit more power I'll use xmlstarlet. If neither of those is enough, I'll use Python's beautifulsoup package :)
I like xmlstarlet too, if only because it's old enough that I can reliably get it in package repositories and the dependency footprint is tiny (less an issue now with this tool written in Rust, but previously I was comparing to NPM- and PyPI-based affairs).
lxml is one of the most pleasing to use Python libraries ever, managing to wrap a hot mess of XML APIs in a consistent and Pythonic fashion that you rarely need to escape. IIRC I used beautifulsoup to parse the HTML of a site, and then lxml and either find items and fields by CSS in IPython for quick and dirty data munging, or knock up an XSLT file to transform what I'd scraped into good data in an XML file :)
This looks really neat! It supports a bunch of different query types, and can even do things like follow links to get info about the linked-to pages!
It's also in nixpkgs, though for some reason the nixpkgs derivation is marked as linux-only (i.e. not Darwin). (Edit: probably because the fpc dependency is also Linux-only, with a linux-specific patch and a comment suggesting that supporting other platforms would require adding per-platform patches)
Once upon a time I was using pup[0] for such thing as well as later I changed to cascadia[1] which seemed much more advanced.
Comparing the two repos, it seems pup is dead, but cascadia may not be.
These tools, including htmlq, seem to sell themselves as "jq for html", which is far from the truth. Jq is closer to the awk where you can do just about everything with json. Cascadia, htmlq, and pup seem closer to grep for html. They can essentially only select data from a html source.
Well, jq is grep as well as sed and awk, but yeah, htmlq seems to be just grep, for sake of comparison.
But I don't think html has any need for a sed/awk tool, or at least not as much. Json output could very well be piped forward to the next CLI tool after you've changed it slightly with jq. I don't see this scenario as likely with html.
"A good opportunity to introduce `gron` to those unfamiliar!"
Thank you - appreciated.
I haven't done much work with json but have had reasons recently to do so - and I immediately saw how difficult it was to pipeline to grep ...
But what I still don't understand is that some json outputs I see have multiple values with the exact same name (!) and that still seems "un-grep-able" to me ...
> But what I still don't understand is that some json
> outputs I see have multiple values with the exact same name
This is neither explicitly allowed nor explicitly forbidden by the JSON spec. It is implementation dependent upon how to handle - does one value override the other? Should they be treated as an array?
In practice, this situation is usually carefully avoided by services that produce JSON. If you are interfacing with a service that does produce duplicate values, I'd be interested in seeing it for curiosity's sake. If you are writing a service and this is the output, then I implore you to reconsider!
You might be missing a change in index: `obj[0].prop` vs `obj[1].prop`. Or, your JSON might have the same property defined multiple times: `{a:1, a:2}` (though I'm not sure how gron handles that situation).
I have been using hxselect from the html-xml-utils package to do this for many, many years.
It doesn't handle malformed HTML that well but can be coaxed into working about 90% of the time, with the help of the other included package hxclean or something like html-tidy.
sed: "While in some ways similar to an editor which permits scripted edits (_such as ed_), sed works by making only one pass over the input(s)"
ed: "ed is a line-oriented text editor".
Software definition through a reference to another software is somewhat confusing. Potential users come from different backgrounds (I had no idea what is jq), and it is not clear what are the defining features of each project. Is jq line oriented? Is htmlq operating in a single pass?
1st sentence - Explaining the tool for those the tool was made for without beating around the bush.
2nd sentence - Explaining the tool to folks in the general web domain what it can do for them.
3rd sentence - Explaining where to learn how to use the tool if you've stumbled across it but web is not your area of expertise.
All that info fits in nearly 25 words then it lists the options for the tool and jumps straight into multiple examples (with outputs!). If the only explanation had been "htmlq: like jq, but for HTML" I'd agree but having the comparison to explain what it does isn't a bad thing it's _only_ having the comparison that would be bad.
Personally I think this is a model example of a opening for a Github readme.
As web developer for over a decade "bits content" doesn't mean anything to me. But I understand what the tool does from the rest of the description. Try running a google search for "bits content," [0] it's not a commonly used phrase in web development or anything. It's a poor choice of words.
It's more than fair to say in technical documentation you intend others to use having a grammatical error or missing word is confusing and a problem. It's the writing equivalent of having a bug in your code. And it's definitely not "writing to a target audience" as the parent comment suggested. We all make mistakes but don't try to call a mistake effective documentation.
Of course it is, but neither parent nor anyone else is saying anything close to the mistake being effective documentation. There's a single missing word which needs to be added in, but the overall text is clearly writing to a target audience. You are aware of this, and of how small the mistake is, and you understand what the sentence should read as, so I'm not sure what your point is?
"htmlq is like jq but for html" is a very specific 'dog whistle' for people who use jq. I agree that people who don't know what jq is will get no value and pay no attention. But for people who use jq, the claim is, like a dog whistle, clear, concise, and means exactly what it says. In two seconds, everyone using both jq and html will instantly know what is available and log it away.
So for general purposes, it's a terrible marketing pitch. And yet I think it's a very, very valuable demonstration of knowing some of their 'customers'.
this isn't what a dogwhistle is. it's just explanation by analogy to a model presumed to be shared by the intended audience. a dogwhistle offers a surface meaning to the uninitiated that's anodyne but communicates a hidden, coded message to those who possess some undisclosed, shared knowledge with the author. this kind of analogy entirely lacks the surface meaning and the message shared via jargon also communicates something about how you might learn enough to understand the analogy.
I can't speak for people who don't know jq, but knowing jq, this is a great tagline: it gives me an immediate understanding of what it does, how I could expect to use it and what value and ease of use I can expect.
I agree, however if you do know how to use jq than "like jq, but for html" is extremely effective. I use jq all the time and that title hooked me, I immediately wanted to try it.
But if you haven't used Jq that I can see how that title is less than helpful.
The first three are not proper definitions per-se but kind of an advertisement, trying to familiarize by self-comparison with a tried & true tool that has proven its worth.
You know Jimmy the famous mechanic? I'm Timmy, _his brother_ but an electrician.
IMO, at least `jq` has proven itself as the indispensable tool for json-data manipulation.
It's a person who does mathematic calculations all day. For example, creating range tables for artillery, calculating averages or totals of a large range of values, or solving complex integrals or differential equations, and so on. They're commonly used in industry or government, especially in astronomy, aerospace and civil engineering for both simulation and analysis. Perhaps the most well-known computers were the Harvard Computers, which operated in the late 19th and early 20th centuries.
As a job, computers were largely automated out of existence by solid-state transistor based automated computers and integrated circuit transistor automated computers in the 60s, 70s and 80s, which replaced the enormously expensive and often largely experimental electro-mechanical automated computers while radically reducing cost and improving performance both by several orders of magnitude.
I mean...if you read the github readme it literally describes what it does in the next line: "Uses CSS selectors to extract bits content from HTML files".
> Software definition through a reference to another software is somewhat confusing.
Possibly, depending on background as you note, but not all promotion is intended at the same audience. When submitting to HN, "like jq, but for X" is short and conveys what it is to most the people that would care, I think. jq has been submitted and talked about here many times with lively discussion over the years.[1] At this point I think most those that are interested in what that is and what this is will understand fairly quickly from the title. Those that don't might be missed, or they might look it up like you, or they might see it through some other submission some other time with a different title which isn't based on a chain of references.
JSON schemas are a thing that exists and can be useful, and `jq` probably covers the 80% of use cases for querying JSON.
XSLT can be an amazing tool when used properly and I've wondered about a JS equivalent over the years and started writing one on a couple of occasions. But JSON is just a data structure and not structured markup, and there's no sweet spot for a transformation tool like XSLT - you're more likely to be doing a "find items in JSON, filter() them, then map()/reduce() to output format" task that takes a minute or two in Node and then never gets used again, or doing a complete map from one domain to another where you'd need to do it in JS because of the complexity of processing and ability to handle errors, use third-party tools and even write tests.
An XQuery-esque language allowing selecting bits of JSON file(s) with filtering, grouping and ordering built-in, combined with a way of projecting results that's no worse than JS allows for i.e. not having to put quotes around everything and the like :)
Command just prints a bunch of text stitched together:
curl -s "https://news.ycombinator.com/news"|hq "tr .storylink"
Deploy a website on imgur.comMy £4 a month server can handle 4.2M requests a dayFirst Edition...
Not author but neither is the poster: Jq got away with it because it's one of the few 2 letter combinations that wasn't absolutely overloaded and "jquery" was already taken. OTOH nobody shortens HTML to H and HQ is an extremely common acronym, if not one of the most popular 2 letter acronyms you could pick.
There is `xq` today, which parses XML like `jq`. I think that it is relatively unknown because it is part of the `yq` package for parsing YMAL. So just install `yq` via PIP and you'll get `xq` as well.
There is also `xmlstarlet` for parsing XML in a similar fashion.
Just looked into this and I think it's worth mentioning that there are two different projects called `yq`. The first one that came up (written in go instead of python) is not the right one and doesn't have the `xq` tool.
xmlstarlet is really nothing like jq, as a language. But yes, I use it because it is the best commandline xml processor I'd found. That's the only similarity to jq.
This is a bash one-liner! But TBF it really is a 'jq for xml'. I think it would be horrible for some things, but you could also do a lot of useful things painlessly.
Thank you for the comments. I've only recently discovered both tools, and literally used them once each. Of the two `xq` was easier for my particular work case (parsing a Magento config) but I keep both tools in my virtual toolbox.
If you have any other suggestions for parsing XML for exploratory purposes I'm very happy to hear them.
Thanks! Not actually a reccommendation, but I have used xsltproc (command line xslt), but it is horrible to use because xslt syntax is horrible (though xslt's concepts are pretty cool). One thing is it enables you to use XPath in all its glory.
Just installed xq. It's nice just seeing the pretty-printed json output, so thanks for the pointer. Probably better than xmlstarlet for my usage, which just queries and outputs text, not xml. hmmm, that's probably true for most commandline uses...
If you make the html well formed, xpath also works great. Great stuff if you ever need to pick html apart. Used this quite a bit when microformats were still a thing together with jtidy.
Jq is very loosely inspired by that, I guess. Might come full circle here and use some XSL transformations ...
You can usually find a html parser for your language, that you can use xpath/xsl on. It will just make the same assumptions that the browser does, by adding missing closing tags etc.
I made a tool that extracted parts of web pages 10-15 years ago, and it worked well. There are of course cases where the html is so unstructured that the results were unpredictable, but it worked well in general.
Nice, I expected something based on XPath (like xpd), but web developers dealing with HTML are infinitely more familiar with CSS selectors, so a great choice!
Sure, that sounds nice, but having two simple tools each doing the job well in its own space is perfectly fine for me — do you imagine needing to combine Xpath and CSS queries in a single run?
shameless self promotion:
parsel[0] is a python script in front of the identically named python lib, and extracts parts of the HTML by CSS selector. the advantage of it compared to most similar tools is that you can navigate in the DOM tree up and down to find precisely what you want if the HTML is poorly marked up, or the searched parts are not close to each other.
https://jsoup.org/ has been around for a long time and seems a bit more mature & maintained than this two-code-files 2-year-old repo. Highly recommend.
Well once there's an HTML parser, then a pdf viewer, and then everything needed for PDFs (ie., programming, emailer, video support, etc.) we'll finally have that ideal operating system we've been waiting for.
JQ is not just a parser but a tool for doing operations, many of which are (or should be) generic across any tree-like data format. Reusing that part across different input formats makes a lot of sense.
For reasoning about tree-based data such as HTML, I also highly recommend the declarative programming language Prolog. HTML documents map naturally to Prolog terms and can be readily reasoned about with built-in language mechanisms. For instance, here is the sample query from the htmlq README, fetching all elements with id get-help from https://www.rust-lang.org, using Scryer Prolog and its SGML and HTTP libraries in combination with the XPath-inspired query language from library(xpath):
Yielding: The selector //(*(@id="get-help")) is used to obtain all HTML elements whose id attribute is get-help. On backtracking, all solutions are reported.The other example from the README, extracting all links from the page, can be obtained with Scryer Prolog like this:
This query uses forced backtracking to write all links on standard output, yielding: