Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Introducing schema.org: Search engines come together for a richer web (googleblog.blogspot.com)
223 points by Uncle_Sam on June 2, 2011 | hide | past | favorite | 78 comments


All I could think while reading through the getting started was: that is an awful lot of added text. After a little more thought: that is an awful lot of added work. And while it won't be hard to have tools that make the process easier, the sort of work that goes into adding that data can never be completely automated (otherwise, we would have no need for it). Given that all the search engines will be using it, all major sites basically have to implement this or they risk falling in their rankings.

So, at the end of the day, Google, Microsoft, and Yahoo just made web development more expensive. They probably also just made the web a better place, too.


Sites that don't employ this don't risk falling in their rankings. This added data allows for richer snippets (which absolutely increase clickthrough ratio), but that won't directly make you rank higher (or lower it their absence).

If your website employs an SEO or webstandardista, you should already have your sites marked up with metadata. Reviews and rich breadcrumbs etc. have been around and supported for years now.

I suppose that since Google wants to solve these problems algorithmically first and foremost, that many of these structures already get recognized. Right now you don't _have_ to mark-up your breadcrumbs, for them to still appear as rich snippets on your search result listing. Google recognized the structure without added mark-up.

For now, I will build the new types in my CMS like a good web developer. That won't cost me any time in the future, and now I'll have a way to separate myself from those that won't add schema's or metadata to their mark-up. So, at the end of the day, I just got more expensive :)


Here's what Google Webmaster Central says about schema.org: http://www.google.com/support/webmasters/bin/answer.py?answe...

"Google currently supports rich snippets for people, events, reviews, products, recipes, and breadcrumb navigation, and you can use the new schema.org markup for these types, just as with our regular markup formats. Because we’re always working to expand our functionality and improve the relevance and presentation of our search results, schema.org contains many new types that Google may use in future applications."

"Google doesn’t use markup for ranking purposes at this time—but rich snippets can make your web pages appear more prominently in search results, so you may see an increase in traffic."


Google likely uses clickthrough rate (CTR) in their algo. If your site has a high CTR, it should hypothetically rank higher, so it makes sense for them to include it in their algorithm - to the point that it can't be manipulated.

So, if the metadata doesn't directly increase rankings (which I'm pretty positive it won't), it can indirectly do so by grabbing the users eye and improving CTR, which I am most certain it will.


Some people were already doing this before, but with other formats. I did a little comparison between this and the one Best Buy currently uses, GoodRelations.

https://gist.github.com/1005688

I like this quite a bit better.


Well, the two gists don't quite encode the same information. In the schema.org example, one would still have to write code to parse the string dates. In the RDFa example, one could link several sites on the shared purl.org URL of, e.g., Friday, to answer questions like "which restaurants are open on Monday". That's the whole Linked Data idea.

On the other hand, the RDFa way is more painful when the opening hours differ between the days of the week.


The datetime attribute is designed to be easy to parse. They stick to a consistent format on schema.org. They reference ISO 8601 which is quite a bit more complicated, but hopefully they'll add something saying they only support a tiny subset of what ISO 8601 allows to make it easier to write tools.

http://schema.org/Duration

The tools aren't there yet so it's not as useful for linked data as RDFa right now, but hopefully it will be more useful soon.


I was just thinking about how I could actually see this being really awesome with something like Django.


Interesting idea; I'm a little rusty on my Django knowledge but I guess you would define a custom Model class for each data type, which in turn knows how to render itself with the correct markup.


Won't take long for someone to make a list of django models for each type


no point really, it's just a few extra attributes in the template.


I'm torn as well. It's a lot of new information to remember and creates a lot of extra work but if the content is more accessible as a result then it might be worth it. I'm also a little wary of the fact that it just seems tacked on to the HTML but I can't think of any other way to handle it.

I'm guessing this is like salt... a little dash'll do it.


I would say it's just more easy for them to parse/extract data, it shouldn't change much to the user (excerpt special cases in browsers).


I phrased that poorly. I meant accessible to the crawlers.


Adding a few tags to your HTML templates is "expensive"? With Google, Bing, and Yahoo all supporting this, I don't think it will be long before formtastic-like plugins start popping up to make this really easy.


Exactly. We already have to support multiple browsers, multiple display sizes, ARIA roles, and more. I don't mind the extra work for the handicap (ARIA), but looking up and adding all this new meta information merely for search engines is going be a pain. In the end, this extra work makes it easier for the search engine by requiring the developer to do more work. I would prefer the reverse, otherwise maybe I need to get into the search business. They are creating a standard that will make it incredibly easy for future search engines.


It depends how you present the idea of adding machine-readable information to pages. After all, nobody is forced to do it, so you need to show some benefit before the "semantic web" will happen.

Schema.org is doing it with the SEO angle: mark up your pages like this, and they'll be presented better in search engines.

With VIE (https://github.com/bergie/VIE) we take a different angle: mark up your pages with RDFa, and they'll become editable.


I added the 200 bytes/page to my ecommerce site in about 5 minutes.


This is going to sound curmudgeonry but it is seems like one more way search engines want to use your data without giving you the page view. It makes allot of technical sense and I can imagine some really great ways to use this data but in the end I guess I would just need to ask "what's in it for us- the content providers?"


I think that's shortsighted thinking. This is about improving the user's search experience. If that happens I think everyone wins.

Most mainstream users are not tech savvy. Imagine a traveling person arrives in a city and spontaneously decides to see a movie. They enter the search 'good movie playing in nowheresville'. As it stands now their query will likely be matched by the keywords 'movie', 'playing', and 'nowheresville'. The returned results might include a news article about local theatres, with no actual focus on reviews. The searcher might get frustrated and just decide to rent a movie instead. However, with schemas in wide use search engines will know exactly what web sites are talking about movies and whether it's in the context of reviews. The searcher can then be passed on to the relevant site.

In other words, do you think it's better to tell search engines this is sort of what I have or this is exactly what I have?


Information in specific types — including reviews — exposed using microformats, RDFa or microdata has already been used by Google for over 2 years, they call it "Rich Snippets", and it does improve the quality of experience for users, assuming that you equate an increase in clickthroughs to mean that the user percieves that page to be more useful compared to other SERPs results (and anecdotally I always go for results which include rich snippet information gleaned from pages with the required semantic enhancements).

This announcement is not the proposal of a new technique, but rather the extension of one which is already working and is a good thing for the web.


Ideally, you can predicate your search on metadata. It would greatly simplify my searches for volcanoes with a 4 star or better rating.



there is only one l in alot.


And there's also a space in "a lot"


But, there are two Ls and no spaces in "allot"[1].

[1] http://www.merriam-webster.com/dictionary/allot


That's a different word than the OP's intended meaning. allot != a lot


For most of my "quick searches", I already have the answer from within the search results listing. It looks like it will go one step further, we are not going to have to leave the results page to have complex answers.

I am not sure if I want to provide all my hard work in a format which will maybe help the search engines a bit, but mainly the spammers a lot as they will be able to automate the creation of content farms even more.

Mixed feelings... all the world data in a well structured format is a wonder but at the same time, what will be the incentives to create such an easy to digest content if the world at the end do not even know you are the one how produced it?

Kind of the old media against new media dilemma but applied to the new media. Interesting.


Bingo, this my first response as well. What happens when users stop clicking through to content because it's being served up by Google, Bing or Yahoo?

I guess it could actually hurt them as well. If users aren't providing information back to the algorithm in the form of a click through related to a search term, don't the search engines also risk losing a key signal of relevance?


Why unilaterally assert this will decrease click-throughs?

If I see immediately relevant data for restaurant hours, movie times, a person's bio, etc, I'm far more likely to click-through and start looking at a menu, making a reservation, or getting more background.

They may well expect increased click-throughs leading to more site traffic for those who adopt.


For me this varies based on device. If I'm at my laptop then I'll usually click through, but if I'm on my phone I won't unless I have to.

It takes too long to render sites with loads of images and adverts on my phone, so I always dread clicking a search result.


Assuming that authoritative sites, at least, don't abuse these schemas, this will help all search engines and data mining/nlp researchers build better models. The biggest gain isn't quick view of search results, it's that now search results will be better in general because Google et al will now understand if a page is really about a person and in many cases, who specifically it is about to point out a specific example.

Information extraction, just got that much easier. Hello, baby semantic web.


if they actually wanted people to use this they'd write better documentation.

if I picked 10 web devs off sitepoint and instructed them to add 10 assertions to an HTML page and didn't give them a validator, I'd be amazed if more than 3 got 80% of them right.

i like the taxonomy though, but honestly i think instances are much more interesting than types... rather than saying "George Washington" a :US_President, can't we say "George Washington" is :George_Washington where :George_Washington is his identifier in Freebase?


Very cool, but why didn't they use the existing RDFa format/keywords but invent their own? [ see http://en.wikipedia.org/wiki/RDFa ] [ itemprop vs property ??? ]

I guess when you are big G, you can do anything you want.


I am wondering why they didn't go for something that allows for richer semantics like XHTML+RDFa http://www.w3.org/TR/rdfa-syntax/


http://schema.org/docs/faq.html#14

It seems that they chose Microdata over RDFa because the latter's syntax was deemed to be unwieldy.

It's not really true that RDFa is more extensible than microdata, there are a small number of missing features related to XML data, but nothing too significant for these use cases; see, for example, [1]

[1] http://bnode.org/blog/2010/01/26/microdata-semantic-markup-f...


From http://schema.org/docs/datamodel.html "... In fact, all of Schema.org can be used with the RDFa 1.1 syntax as is. ..."


Since RFDa is not in the HTML5 spec yet, right now you can't make valid HTML5 websites that make use of RDFa.

But you don't have to use microdata or microformats; you can still use this schema with RDFa on an XHTML page.

You should be able to express everything in Microdata _and_ RDFa. I don't think RDFa is semantically richer than these other formats.


as far as my understand goes, this is basically equivalent to a subset of RDFa.

The differences as I understand them are three: * schema.org has an implicit vocabulary, if you want to use more than one you can stil use RDFa and use the schema.org vocab explicitly * some syntactic hacks are missing (curies, chaining) but these do not remove expressiveness. Again, implicit schema. * typed literals are missing. And once more, not really needed when the schema is only one

I still would have preferred if they had used straight RDFa 1.1, but I think their main motivation is that the way the web is going (HTML5) does not seem to be the same it was when RDFa was initially invented (xhtml).

This solves concrete a finite set of problems now, while in the semweb world people still have to agree on how to express a person's name :/


Hi all: Two comments: 1. The GoodRelations RDFa vocabulary remains the superior way of sending rich data to Google and Yahoo; even Bing just announced they will support it in the future.

2. As for tooling, here are two super-easy ways of adding rich GoodRelations data to your site: - http://www.ebusiness-unibw.org/tools/grsnippetgen/ This creates a snippet of a few additional divs/spans based on your data; simply paste it before (!) the respective visible content. You are done ;) For products, see the effect in http://www.google.com/webmasters/tools/richsnippets - if you are using a standard shop package, e.g. Magento, osCommerce, Joomla/Virtuemart, or WordPress/WPEC, there are free extension modules that add GoodRelations:

http://wiki.goodrelations-vocabulary.org/Shop_extensions

A similar module for Prestashop and Oxid eSales is in the making.

Best Martin Hepp

Disclaimer: I am the inventor and lead developer of GoodRelations. GoodRelations is free to use, remix, or adapt under a Creative Commons license.


This sounds rather similar to the Facebooks OpenGraph protocol (http://developers.facebook.com/docs/opengraph). I wonder if this is related to Google's planned entry into social. It would help if they knew the context of search terms so they could match it up to ads in Gmail for example. or use your gmail conversations to help reorder search results...


Am I alone in thinking schema.org is a direct answer to Facebook's Open Graph Protocol?


Did Google just deprecate microformat?


The problem I've found with Microformats is that it's a misappropriate use of the class attribute that ends up causing problems on websites that have extensive templates and stylesheets. I'd personally rather have an attribute that doesn't already have another purpose as it's not likely to be abused, intentional or not. I had been looking into using RDFa however the syntax seemed burdensome and unwieldy. Microdata looks like a nice middle ground between the two previously supported rich snippet types.


Yeah, what is itemscope and itemprop? That's a kick in the butt for us Microformats users.


The issue I found with microformats when I was evaluating it was is that there was no way to automatically transform a microformatted file into a native data structure without first knowing the schema. Writing a microformat-to-JSON parser is hard because you have no way of knowing which classes are significant and which are just there for styling.


Isn't that true of schema.org (microdata) too? That's not a very apt criticism.


The question remains: Who needs this?

I am still not really convinced that it is possible to integrate handcoded schemas for a wide range of use cases into search results in a meaningful way.

The solution Google proposes here will also restrain the content of websites in a lot of ways if it becomes widely adopted. Look at the recipes-example, it defines markup for including nutrition information for recipes:

"Can contain the following child elements: servingSize, calories, fat, saturatedFat, unsaturatedFat, carbohydrates, sugar, fiber, protein, cholesterol"

Every company that serves recipes on the web and decides not to offer this information because it deems other properties of recipes more important is now at a disadvantage. Google will show more information about the recipes of their competitors and presumably also rank them higher because they have included 'valuable' markup information in their recipe.

This approach favors shallow information ressources over complex ones as the former can be more easily parsed by metadata-crawlers.


I do, because I can get better search results:

Chicken recipe where calories<=350 and carbs<=20g


Enjoy your allrecipes.com dishes.


I usually do.

ranking>=4 and reviews>=50


As I point out elsewhere, machine learners and NLP researchers need this. This could offer tons of manually labeled training data.


They neglected a format for job listings, which is unfortunate. We (LinkUp) put one together at http://wp.me/pJYG0-1H, it will be interesting to see if it gets any attraction and if they're actually seeking external input.


For a site put together by search engines, the URL structure for the site search is atrocious. "#q=Product" and not "?q=Product"? Who thought that was a good idea?

Site also looks a bit like spam. Needs more Firefox-esk awesome graphics, imo.


As an SEO that site search structure is a good idea, especially since they don´t employ canonical or noindex/follow on the search pages.

Let us say we both link to http://schema.org/docs/search_results.html#q=test and http://schema.org/docs/search_results.html#q=product .

Since it is a hashtag we link to, all the link juice will get consolidated in the search page, which passes it on to the rest of the site.

If we had linked to http://schema.org/docs/search_results.html?q=test and http://schema.org/docs/search_results.html?q=product we would have created two (low-quality and near-duplicate) pages in the google index.

The same principle applies to pagination. If you can do javascript pagination #page=2 vs. dynamic pagination ?page=2 you are nearly always better of with the hash pagination. If you do it right, you get the benefit of a single page, with the added bonus of being able to bookmark a certain page and having browser history working.


Yeah, this seems like a lot more stupid work and it'll also make your site easier to scrape for blog network content stealing SEO dipshits.


What's going to stop people from gaming this by doing things like adding fake 5-star reviews to their website? (especially brick and mortar stores that show up in google maps/places)


Putting aside the technology and schema decisions they've made IMO its great to see these three throwing their weight behind some common metadata even if it does step on some toes.

Now if they'd only add some schema targeted towards downloadable public data sets. I'm dying for a good global public dataset search beyond competing data markets and data.gov.* sites.


They should start with http://wiki.ckan.net/Schema_for_Packages for datasets

There's a lot on schema.org for social media websites/business lookup but should be more for open data. I was looking for a linked data schema to represent financial transactions (X paid Y $999 for Z) but schema.org only goes so far as Sales. XBRL explictly states it is not for "A transaction level activity".


I thought that snatching the type of content of my pages should be their job. I'm so naive.

No problem, I'll add a few kbytes to every single page of my sites, so I can replicate the information I've already stated in a number of sitemaps, video sitemaps, headers and XML files.

P.S: I'm not agaisnt standardization at all, I'm just saying, this comes a bit late.


Whatever happen to HTML5's data-* attributes?


Nothing an xml data island could not have solved. Even an external xml data island with internal references for better performance and less clutter. Even an external JSON data island would have been better for web consumption.

Microformats? I'll pass.


Ok, here is my proposal. Use a link to an external resource with all the information you want attached to that page like:

    <link rel="data" type="data/json" href="http://example.com/recipes/chicken.js" />
    <link rel="data" type="data/xml" href="http://example.com/recipes/chicken.xml" />
The resource can be cached, served static or even included in the page inside a <script data> tag

Here is the html:

    <div itemid="1234">Chicken marsala</div>
    <div itemid="1235">Fried chicken</div>
    <div itemid="1236">Chicken curry</div>

Here is the data island in json:

    {
      head:{
        title:'',
        source:'',
        version:''
      },
      items:[
        {
          id:'1234',
          type:'recipe',
          title:'Chicken marsala',
          ingredients:'here...'
        },
        {
          id:'1235',
          type:'recipe',
          title:'Fried chicken',
          ingredients:'here...'
        }
      ]
    }
Here is the data island in xml:

    <data>
      <head>
        <title>here</title>
        <source></source>
        <version></version>
      </head>
      <items>
        <item id="1234" type="recipe">
          <title>chicken marsala</title>
          <ingredients>here...</ingredients>
        </item>
        <item id="1235" type="recipe">
          <title>fried chicken</title>
          <ingredients>here...</ingredients>
        </item>
      </items>
    </data>


actually that's what rel="alternative" is used for; an alternative representation of the current page. You something like this could be done:

  <link rel="alternative" type="application/event+json" href="http://example.com/events/2010/06/03/schweet.json" />


Actually after reading the microdata spec, there's a application/microdata+json format that would probably work better:

http://dev.w3.org/html5/md/#application-microdata-json

So you'd have an alternative resource

  <link rel="alternative" type="application/microdata+json" href="http://dev.w3.org/html5/md/#application-microdata-json" />
That file would look like this:

   {
    "items": [
    {
      "id": "http://example.com/events/2010/06/03/schweet",
      "type": "http://schema.org/Event",
      "properties": {
        "startDate": ["2010-06-03"],
        "location": [{
          "id": "http://example.com/places/my-crib",
          "type": "http://schmea.org/Place",
          "properties": {
            "url": ["http://example.com/places/my-crib"],
            "address": [{
              "type": "http://schema.org/PostalAddress",
              "properites": {
                "addressLocality": "Knoxville",
                "addressRegion": "TN"
              }
            }]
          }
        }]
      }
    }
    ]
   }
Who knows if Google will actually use that file though.


So Google both ranks on page speed and encourages you to double your bits by adding a lot of cruft to your html. Wonderful.


I wonder if they'll reward websites with higher page ranks if they implement this though.


at this point, page speed is affected by things like http requests and javascript, something as insignificant as a couple kilobytes of compressable text would have an impact measured in the microseconds


Maybe your pipes are a lot fatter than mine.

kb/us == mb/ms == gb/s

On the other hand, I agree that a even a kilbotye of extra data to have fundamentally better search result pages is a big win for everyone.


Seems that if you use this, your documents won't be able to be considered 'valid' by validators (tested on the w3 validator). Unless, perhaps, you just mark your doctype as html and be done with it?


> Seems that if you use this, your documents won't be able to be considered 'valid' by validators

Why would I care? What matters to me is that Google gets people to my site.


some tools will throw out misleading errors on 'invalid' markup, and you will have to spend time justifying to clients why these invalid markups are OK. accessibility tools might have problems (not sure now, but I remember some tools years ago having problem on some invalid markup).


Not likely, few clients even know what markup is. I'll take the extra traffic and suffer the occasional nosy client any day.


I've dealt with state agencies and higher ed depts that get their dander up over this sort of stuff.


Exactly, I wonder why they wouldn't incorporate the data-* attributes to help describe this data AND conform to HTML5 specifications.


Looks like they're using HTML5 microdata: http://dev.w3.org/html5/md/


The HTML data-* attributes are intended for private data only; i.e., to store data to be used as configuration for a Javascript plugin but which does not hold semantic value and cannot be represented as actual content, whereas microdata (which is what theyre using and is also part of the HTML5 specification) is meant only for describing how the content of the page maps to some schema.


Which format and scheme, with which doctype did you use to get invalid results in the validator?

Microformats should never invalidate any doctype as they are just class-names.

Microdata can be used in valid HTML5 doctype pages.

RDFa can be used in valid xHTML+RDFa doctype pages.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: