Note: I turned this post into a talk for Confab Central 2014 with added detail and updated info about Hummingbird. See the full write up here.
This is how Hummingbird works, and why it matters to content strategists.
Hummingbird is only the latest update in Google's quest to create a better search experience for users, and it's probably the best thing to happen to content strategists since our job title was invented.
Hummingbird represents a truly fundamental change in the way that Google searches and classifies the internet--one that does not rely as heavily on keywords or links as it has in the past. Instead, it focuses on serving high-quality, semantically relevant search results, sometimes regardless of on-site keyword matches or high numbers of links.
And the result--hopefully--is better search results for all of us.
However, knowing that Hummingbird focuses on finding "high-quality, semantically relevant content," is not very useful unless you know how Hummingbird is determining quality and relevancy. Hopefully in laying out some of the aspects that are helping Hummingbird function, we'll all better understand how to optimize our content strategy and--by extension--our content marketing efforts.
So, let's jump in.
Google is shifting away from exact keyword matching and toward semantically understanding the intent of a query, serving results based on that intent. But in order to do that, the algorithm needs to recognize and classify "entities." Entities are people, places, things, or ideas in the real world.
To that end, back in 2010, Google bought a little company called, Metaweb. Metaweb is a data company that compiles "triples" from the content of the internet and uncovers relationships between them. Yeah, I know. That's not very clear. So let's illustrate.
A triple is a grammatical construction of [subject], [object] and [predicate]. (Grammar nerds, unite!)
For the purposes of analysis, a simple triple would be:
"Miley Cyrus sings Wrecking Ball."
"(Miley Cyrus [subject]) (sings [predicate]) (Wrecking Ball [object])"
Identifying one triple is nifty, but when you can compile a lot of them, you start to see patterns and associations arise:
- Miley Cyrus sings Wrecking Ball
- Wrecking Ball is a great song
- Billy Ray Cyrus is the father of Miley Cyrus
- Miley Cyrus is Hanna Montana
- Hanna Montana was a Disney TV show
- Billy Ray Cyrus was a one-hit wonder
- Wrecking Ball is a controversial music video
- Miley Cyrus twerked at the MTV video music awards
As these semantic constructions begin to pile up, you start to see a network of associations that link these people, places, things, and ideas. And each person, place, thing, or idea becomes an entity with associated attributes.
Since Google can now "understand" those associations, it's going to serve up results relevant to that entity not simply show results where the words "miley" and "cyrus" show up on the internet. For example, take a look at the SERP below:
Because Google recognizes Miley as an entity and not just a string of keywords, it is trying to metaphorically understand what you--the user--could possibly be looking for when you type "Miley Cyrus" in the search bar, and it is delivering results to match possible reasons for performing the search:
- What is the latest news about Miley? = Google News, Huffington Post, People.com, MTV.com
- What is her official website? = MileyCyrus.com
- How can I connect with her on social media? = Twitter, Facebook profiles
- Who is Miley Cyrus? = Wikipedia, Bio snippet in side bar
- What songs does she sing? = Songs in side bar
- What does she look like? = Images in side bar
- What are her music videos? = YouTube topic page
- What movies/tv shows has she been in? = IMDB, Movies and TV shows in side bar
Obviously, Miley Cyrus is a broad, rich example of an entity, and not everything that Google identifies as an entity will have the same rich results.
In fact, because there hasn't been an overnight dramatic shift in the SERPs, many SEOs have scoffed that Hummingbird actually affects 90% of search results, as Google claimed at their initial release event. However, it’s important to keep in mind that compiling the data needed to affect every possible query will take some time, and highly competitive queries with a large quantity of diverse, rich content options available will see the most dramatic changes first.
Be that as it may, shooting for entity recognition is definitely something that you should think about in your content strategy, and raises questions that you should be asking about your content. Like, is the content on your website speaking to the questions and concerns that your users have when they are searching for you or the services you provide? And, perhaps more importantly, how are other people talking about your brand on the internet? What are the associations that would arise if Google is examining triples with your brand name in them?
...which leads us to co-occurrence.
Let's continue our trip into the semantic web by looking at another aspect of how Google determines entities and relevance in search results: co-occurrence.
The principle of co-occurrence is pretty much like it sounds. Beyond triples, when the Google algorithm crawls the web for information, it is also compiling semantic associations based on context. That is, when Google crawls a page of text, it is analyzing and compiling words that are often used together across the web--words that "co-occur" naturally.
For example, the words "Lebron James" and "basketball" co-occur at a high frequency across the web. Thus, Google can further determine that Lebron James (an entity) is highly associated with basketball (another entity). Based on information in a Google patent, this aspect of the algorithm works roughly like this:
1) Google culls the top 1000 or so results for a term like "Lebron James."
2) It takes each document and weeds out all the most commonly used words on the internet, (i.e., the, a, and, with, etc.). What is left is a sequential list of words as they appear on the page (mostly nouns, verbs, and adjectives).
"As power forward for the Miami Heat, Lebron James won two NBA championships and will be remembered as one of the greatest basketball players of his generation."
Becomes something like this:
"power forward Miami Heat Lebron James won two NBA championships remembered one greatest basketball players generation"
3) Google identifies the term "Lebron James" and then assigns a numerical value to all the other words on the page based on their proximity to the prime term.
For example, in the above sentence, "Heat" and "won" would be assigned a numerical value of 1. "Miami" and "two" would be assigned a value of 2. "forward" and "NBA" would be assigned a value of 3, and so on.
...(3)forward (2)Miami (1)Heat [Lebron James] won(1) two(2) NBA(3)...
4) Those association values are then compared across 1000s of documents to determine the words and phrases that appear at a high frequency with "Lebron James."
5) At this point, Google can better determine the kind of results Lebron James should show up in; the lower the score, the more highly it is associated with the prime term.
The result: If there are a high number of people referring to Lebron James as "the greatest basketball player"--even if these words never appear on his official website or social profiles--Lebron James may begin to appear in queries for “greatest basketball player."
Take a look at this:
Notice that Lebron James appears multiple times in title, text, and images--and none of those mentions come directly from him or appear on his official website.
This is all fine and good for a celebrity, but what about your brand? The way that you are talked about on the internet--the context in which your brand name naturally occurs--can affect the searches that you appear in, despite your use or non-use of those terms on your own site.
For example, take a look at this:
Through co-occurrence analysis, Google can determine that Pixlr is essentially an "online photoshop"--which is precisely true--without Pixlr using that term on its site at all. It even outranks official Photoshop content. However, what is perhaps more interesting is that you can see Pixlr competitors have optimized their title tags and meta descriptions for “online photoshop," yet they do not rank nearly as well. This is an indication that Google’s co-occurence data can potentially outweigh on-site content in queries.
So, not only does your content strategy need to dictate how you talk about yourself on your own website (hello, style guide/brand book), but also the words that you want to be commonly associated with off-site.
In order to optimize your content, give your content marketing team an idea of the types of contexts (and the semantic constructions) in which you want to be mentioned off-site.
Speaking of context...
I'll keep this one short because it works under roughly the same principle as co-occurrence.
Not only is Google using co-occurrence to determine topic relevancy, it is also using co-citation to determine authority. Similar to co-occurrence, Google is looking for mentions of entities and the links they are commonly associated with.
But instead of me jumping into a long explanation, Haris Bascic has already created a nifty little graphic that explains it beautifully.
The principle is simple, Google can determine if site B and C are related by topic, subject, or authority by looking for citations of those sites in proximity to one another. Whether site A places both links on the same page or different pages within the same site, Google can see that the two sites are related through 3rd party content.
This means that Google is associating you with your competitors or partners even if you don't mention each other on your respective sites.
This could have both positive and negative ramifications. If you are being cited on the same sites and in the same articles that mention your competitors or partners, that's a good thing. If you're not, Google may not understand that you are playing in the same field or that you are even a viable competitor or alternative.
So, in terms of content strategy, you need to expand your content beyond talking about yourself. Google is already determining appropriate relationships based on natural linking patterns on the internet, so you should be part of that conversation.
Are you being mentioned in the same places that the movers and shakers in your field are being mentioned? How are you acknowledging that relationship in your own content?
4) Panda and Penguin
Panda and Penguin have been around for a while, so there's nothing new here, but they definitely contribute to Google's quest to provide users with the highest-quality content. I won't spend a lot of time going through these aspects, but I do want to mention the interesting findings of a MathSight study of Penguin.
In trying to determine the aspects and traits of content that is correlated with high SERP ranking, MathSight found that content containing a high number of words that lay outside the 5,000 mostly frequently used words on the internet tended to rank higher than content that used those 5,000 words at an average rate.
So--and this is only conjecture--whether this is a conscious decision or not on the part of the Google engineers, Penguin is not just determining the uniqueness of content based on keywords strings (i.e., Can this exact string of words be found anywhere else on the internet?), it is also determining the uniqueness of the style of the content.
In other words, if your content is written in a style that uses language in a unique way (i.e., a recognizable voice), it will probably do better in the results.
This means that, if you want to rank well, spinning content in order to get unique grammatical constructions may not be nearly as effective as creating content that has a unique style and voice.
Again, this is one aspect of one study, so take it for what it's worth. But the idea of favoring unique content over mundane content fits nicely with Google's overall goals, so it couldn't hurt to take into account the style and voice of your content across your brand as you execute your content strategy.
Of course, executing and maintaining that style and voice across all aspects of your brand is always the difficult part (wink, wink style guide).
Schema.org markup isn't technically part of any specific Google algorithm update, but it is important to mention because it contributes to Google’s goals with Hummingbird. And, as a content strategist, it is well worth understanding and implementing on your websites.
If Google is trying to semantically understand the information on the internet, it's going to have a tough time doing it without a little help from webmasters. Sure, triples and co-occurrence are great, but Google is still a long way off from being able to crawl your website and inherently understand what all the information on your site actually means.
So, Google got together with Bing, Yahoo, and other search engines to create a shared markup language they could roll out to developers and webmasters to help them assist search engines in identifying and classifying content on the web. The result is Schema.org.
Go to the site and you'll find a number of schema HTML tags that you can use to identify content on your site for search engines, like tags for events, prices, product descriptions, authors, organizations, brands, offers, reviews, addresses, phone numbers, and much more.
I realize that not every content strategist is an HTML master (I'm far from one, myself). But I recognize the significance of offering Google and other search engines clues to the types of granular content contained on a website, so that they can more easily classify your content for user findability. If your content is tagged and easy for the algorithm to understand, you'll be better off than your competitors who have not yet clued into this aspect of content.
It's kind of like UX design for Google. Make it easy for Google to "understand" your content, and Google will keep coming back for more.
It's a Bird! It's a Plane! ... Well, Yeah, It Actually Is a Bird
Okay, take a breath. I know this is a lot of technical information to throw at you. And it's not the kind of thing that we content strategists are used to dealing with. Historically, the Google algorithm has been the purview of SEOs. But semantic meaning is our realm. It's what we're good at.
Content strategists are adept at creating meaningful experiences through language, images, and design--the gooey, hard-to-quantify stuff we all love. And Hummingbird is the first step in a process to help Google actually measure, classify, and rank relationships between all that gooey content on the internet.
So, if you want to get ahead of Hummingbird, here are some things to keep in mind in your content strategy:
- Focus on becoming an entity through on- and off-site content.
- Get mentioned in contextually and semantically relevant places across the internet.
- If you've been focusing on optimizing your content for keywords, audit it with a focus on forming a unique voice and creating real value for your audience.
- Implement Schema.org markup to help Google "understand" your site more easily.
Hummingbird is not scary. In my opinion, it is one of the best presents Google could give to content strategists. Now it's our turn to take the wheel, understand what it's all about, and create strategies that simultaneously plan for great content and site optimization.