Speakers: Peter Linsley (senior product manager at Ask.com), John Riccardi (Product Manager at Yahoo! Search Europe), Aaron D’Souza (Software Engineer at Google Inc.)
Peter Linsley - Ask.com
He begins by letting us know that ask.com is the number 6 global web property, and the number 5 global search property with 25% reach in the USA.
He continues to explain how to control crawl behaviour in your robots.txt through user-agent, disallow, crawl-delay, noarchive, noindex and nofollow. He emphasizes efficiency by recommending compression to save bandwidth. You should avoid duplicate content by using 301 redirets.
Crawler identification: different user agents may be showing up in your logs.
- Ask Jeeves / Teoma for the web crawler and image crawler.
- Bloglines for the RSS feed crawler
How to be crawled efficiently
- Check crawler permissions, i.e. your robots.txt file
- Clean your site organization: the crawler needs a way through, so provide text links (use a text-only browser to check this). Use site maps to help the crawler and humans along. Remove session IDs etc.
- Character set, language, country: normalize data in 1 format. Unicode is a great choice for character set. Identify locale with meta tags. Language and / or country delivery on the basis of browser settings or IP are okay, but do provide links to other languages.
The Ask site submission has been discontinued. Page quality is considered by the crawler. More information about Ask crawls can be found in the Crawler FAQ.
John Riccardi - Yahoo
John begins by saying that the principles of Ask also apply to Yahoo and that hypertext navigation is very important for Yahoo as well.
He continues to promote Yahoo’s latest service: Yahoo Answers. Yahoo Answers is a community to help users find, use and expand human knowledge. It is available only in English at the moment, but will be released in other major languages later this summer.
Over to algorithmic search:
It is important that there are inbound links to your site’s content in order for it to be visited by the crawler. You can submit your site via RSS or you can enter the URL at Yahoo site submission.
To see what Yahoo has indexed already, you can use the Yahoo Site Explorer. It also shows reverse links, which creates an understanding of how Yahoo views your site. There is also a web services API.
All major changes to the Yahoo index or ranking algorithm are posted on the Yahoo blog.
Yahoo too has multiple crawlers (user agents):
Slurp for Yahoo Search
YahooSeeker for Yahoo Shopping
Yahoo-Newscrawler/3.9 for Yahoo News
Yahoo-MMCrawler/3.x for Image search
Yahoo-MMAudVid/1.0 for video search
Redirect handling
Redirects (301 and 302) between domains: Yahoo! keeps the “target”.
Permanent Redirects (301) within a domain: Yahoo! keeps the “source”, if it is a root, and the “target†if the redirect is between deep pages.
Temporary Redirects (302) within a domain: Yahoo! keeps the “source”.
Aaron D’Souza - Google
Aaron begins by explaining the crawling frequency. The Google index is structured pyramid-style, with the pages at the top being crawled more frequently:
At the top of the pyramid: news
A small set of sites that are crawled quickly (on an hourly basis).
Next: fresh
Sites that are changed rapidly are crawled on a daily basis or at a couple of days’ interval frequency.
At the bottom of the pyramid: main
The rest of pages on the web.
Rapidly changing content on your web sites means frequent crawls.
Guidelines DO:
- provide appropriate and relevant content
- submit to web directories (Yahoo and Google)
- let other sites that are relevant link to you
- read the guidelines for webmasters
Google’s goal (and yours): provide what people are really looking for.
Guidelines DON’T:
- cloak
- send automated queries
- hide text or links on the page (this is spotted easily).
The value of signals is being tuned by Google: how would a user react to a page.
Aaron goes on to talk about new stuff at Google: Calendar, Trends, Notebook, Web toolkit (Ajax via java), Picasa for linux, maps coverage of western Europe, local search in FIGS, English - Arabic translation, added features for Sitemaps (feed for mobile devices), Co-op.
Not without pride he tells us that the english - arabic translation tool was developed by a team of researchers that won an award for machine translation last year (didn’t catch which one).
He elaborates on sitemaps:
- using Google sitemaps helps Google to discover more of your web pages and to prioritize for crawling.
- it directly informs Google of the existence of all your pages
- it enabls Google to crawl your site more effectively
- it helps you with crawler issues (you can see crawl success and error distribution).
Moreover, it alerts you for violations of Google’s webmaster guidelines.
Co-op
He gives an ego-query example to illustrate the use of Co-op:
- create a personal onebox for your site and target it for your subject (in this case that would be your name)
- get friends and family to subscribe to your feed
- when a subscriber searches for your subject you turn up above the normal results.
The concept: I am an expert on a certain subject and I share my knowledge with other people. This way, you build your own vertical-specific search engine.
Q & A section
The first question is about duplicate content issues in cases of multiple language sites. John Riccardi answers that the geographical hosting is not a factor. Duplicate content is not an issue if the content is in a different language. In case of .com and .co.uk web sites it could be an issue and it is better to diverse your content by about 30%. Peter Linsley advises to get the IP address hosted locally. Aaron D’Souza nods at this.
Next question is about rebranding and transferring your web site to a new domain. Aaron says that a 301 redirect transfers the link popularity to the new domain. He says this takes about a week, based on how your pages are distributed in the index. John recommends you leave the 301 there forever (for human users who use the old address.) For Yahoo, he says it takes about a week, but it could also be as much as 1 or 2 months depending on the section of the index you are in.
What does the “partially indexed” message in Sitemaps mean? Aaron: it means that Google saw a link to the page, it knows that the page exists but hasn’t gotten round to crawling it yet.
Is the identifier meta tag taken into account? John and Aaron: yes
Is there some kind of time table for tweaking algorithms? Aaron: changes are always occurring, a fundamental change in the infrastructure (new software, new design etc) as occured with BigDaddy allows Google to do much more with the algorithm. Some ranking changes go unnoticed and really what makes certain changes “big” ones is when somebody notices and a buzz is created. John Riccardi says that changes in the Yahoo algorithm occur as often as Yahoo gets them tested, coded and deployed.
Are outbound links taken into account? John: you will not be penalized for not having outbound links. When it comes to inbound links, algorithms are being tuned to find out what is good for users. Link quality is high for links that are placed by users voluntarily. Aaron: search engines want to know what users want. It is the search engine’s job to estimate what users are looking for. Links say “this is what users are looking for.”
Technorati tags: ses - search engine strategies - search - google - yahoo - ask


2 comments ↓
Search Engine Strategies Conference: London 2006…
The Islington Business Design Center yesterday for the June 2006 Search Engine Strategies 2006 Conference Excellent set of seminars. My favorite theme was the link building tips: Link building is becoming a conversation. A personal relationship betwee…
hi,
Thanks for providing good information
Thanks
Srinivas
Leave a Comment