The author focuses too heavily on a useless anti-PDF rant. Why is it a PDF? Because it's better than just sending us the Word file the PDF came from.
The PDF file also meets the exact requirements the author is complaining about: "You should have — at your fingertips — immediate, unrestricted digital access to the full text of any piece of legislation the very moment it’s released publicly by Congress." I'm reading the document on my laptop and I can select and copy text from it. How much more "unrestricted digital access" can we get?
Sure, the PDF in question doesn't have in-document links to jump to parts. But it has delineated pages. Never underestimate how un-tech-savvy congresspeople currently are. They probably want one authoritative version of their document. For modifications, they probably say "the third paragraph on page 20" for ease of reference [1].
The author ideally wants every thing generated by a government to be handed over on a silken pillow in perfect Legislative DocBook markup with all meta data painstakingly entered and kept up to date. While demanding all information be free is a laudable goal, it would be better to spend effort implementing and backing such plans than screaming at people who don't have a clear path to fix their current shortcomings [2].
[1]: They could use section headings and then reference from there, but that incurs another lookup of "Go to the table of contents, find the page number, go to the page, find the reference" instead of "Turn to page, see reference."
[2]: If a clear path to fixing their current shortcomings does exist, please work with them instead of working at them.
It might be beneficial to get the digital copier companies involved in this discussion. If Congress is anything like my various clients who are attorneys, they are using a digital copier to "publish" the file as a PDF. The only other choices currently offered by the Minolta and Kyocera machines I have supported are TIFF... or paper.
The article proposes a process for openness and Web publication of congressional draft legislation that I'm pretty sure is already followed by the Minnesota Legislature on its website.
There should be no technical difficulty in doing this at all, if a large-population state has already been doing it for years, so the only barrier to Congress being equally open is cultural. Note that Minnesota has, officially, three "major" political parties and currently has divided state government, with the governor from one major party and the two houses of the state legislature controlled by another. If those competing politicians can get along well enough to publish current information on draft bills, Congress should too.
I agree that there should be no "technical difficulty" in that it's already a solved problem. But so are bridges, and those only get built when people recognize a bridge is necessary, and someone is willing to pay for the time and materials to engineer it.
I get the impression the author sees some malice behind how Congress publishes their information. I don't see malice, I just think no one with the power to start such a project even recognizes it's a possibility. That it's worthwhile would come second.
Documents should be more open, but please don't push for something data-like (i.e. send the XML proponents off the hill). It's time to make files human-readable again.
For instance, use something completely plain but deceptively simple, like reStructuredText. This can be parsed or edited by a ton of established tools. It's also easily "diffed". Text formats are also well handled by revision control tools (one of which should be used, publicly, by the editors of these documents). This even achieves the clean markup he desires, because multiple tools can parse .rst files and produce very good HTML out of them.
Contrary to the article, the pdf doesn't appear to be locked in any way.
If you run the pdf through "pdftotext -layout", you get a very readable text file which is probably easier to parse than any XML they'd come up with. (Probably easier to deal with than the HTML that Word produces.)
If you want to extract metadata or break it into sections to load into your custom web app, it'd be a couple hours work with your favorite scripting language. (Anything with decent regular expression support.)
(If you want boldface/italics, I'd recommend tweaking the backend of pdftotext.)
For human-readable document content, there's nothing wrong with plain old HTML, though formats like RST or markdown are somewhat easier to produce. (I save my blog documents in markdown format - they're just easier to write and update, and I don't have to futz around with JS-heavy WYSIWYG editors.)
However, these aren't adequate for structured data that falls outside the scope of a generic print document. For that, I would recommend YAML or JSON over XML, not only because they're human-readable but also because they're far less verbose.
I'm definitely fine with using data formats for things that really are just raw data (e.g. a series of records with specific fields), and I like JSON too.
I just hope that they don't try to chop up something that should be "just a document" into a pile of ugly marked-up containers that make it very hard to see what's even in the document anymore.
The PDF file also meets the exact requirements the author is complaining about: "You should have — at your fingertips — immediate, unrestricted digital access to the full text of any piece of legislation the very moment it’s released publicly by Congress." I'm reading the document on my laptop and I can select and copy text from it. How much more "unrestricted digital access" can we get?
Sure, the PDF in question doesn't have in-document links to jump to parts. But it has delineated pages. Never underestimate how un-tech-savvy congresspeople currently are. They probably want one authoritative version of their document. For modifications, they probably say "the third paragraph on page 20" for ease of reference [1].
The author ideally wants every thing generated by a government to be handed over on a silken pillow in perfect Legislative DocBook markup with all meta data painstakingly entered and kept up to date. While demanding all information be free is a laudable goal, it would be better to spend effort implementing and backing such plans than screaming at people who don't have a clear path to fix their current shortcomings [2].
[1]: They could use section headings and then reference from there, but that incurs another lookup of "Go to the table of contents, find the page number, go to the page, find the reference" instead of "Turn to page, see reference."
[2]: If a clear path to fixing their current shortcomings does exist, please work with them instead of working at them.