Qiankun Zhao's Blog

Records my all the happy and unhappy stuff in my life, as a man, as a bachelor, also as a PhD candidate and a lonely heart abroad :)

11/29/2004

Diving in the Deep End of the Web



The Web is more complex than it seems on the surface. There is a hidden Web that lies below the Web that we see in our daily surfing. This hidden Web contains structured information dynamically generated by online Web databases that aren't easy to access or crawl.

I think this is another way of organize the huge web compared with the semantic web approach

11/25/2004

A talk by MSRA DMS group manager

Today, I attended a talk by Wei-ying Ma, the manager of DMS group in Microsoft Research Asia.
His talk focus on web structure and web search. According his opinion, there are many different
types of structure in web data that have not been exploited beside the link structure.
The clue of his talk is as following:
Web page----------->Link structure[of which Pagerank and HITS are based on]
---------->Layout structure [block structure in his WWW2004 paper]
---------->Category structure [semantic structure within a group of web pages such as the structure of the search results]
---------->Discussion Thread structure[structure of information in the discussion board]
---------->Community structure[Object based structure]
---------->Deep structure [structure of deep web].

As a conclusion, he pointed out the characteristics of the next generation search engine should be like this:
From Page to Block; from Surface web to Deep web; from Unstructured data to structured data; from relevance to intelligence; from desktop search to mobile search.

11/23/2004

Search and Suggest



If two brains are better than one, then 100 million brains must be better yet. That's the idea behind Query Graph, a new project from Microsoft Research that combines the thinking processes of everyone on the Web to make search more relevant.

New Ways to Search the Web (From MS Research News and Headlines)
by Suzanne Ross

Sometimes the whole is not greater than the sum of its parts. Sometimes the whole doesn't even represent its parts. Take a Web page for instance. Is all the text on a Web page a variation on the whole? Probably not. There might be weather reports mixed with tips on the newest hairdos, opinion pieces mixed with ads for whiter teeth, articles about national security mixed with links to vacations in Brazil.

What does this mean to you? Poor search results.

Researchers at Microsoft Research Asia have been working diligently on algorithms to fix this. Because a Web page usually contains multiple topics, ranking the search relevance on the entire page isn't always useful. Wei-Ying Ma, the research manager for the Web Search and Mining group, said that they don't treat a Web page as a single unit.

A single Web page contains multiple topics and different parts of the page have difference importance. In addition, the hyperlinks often point to pages on different topics. Every Web page is made up of blocks of information. Some might match your Web search, some might not.

Search engines generally look at each Web page as a unit in assigning a page rank. If the page is viewed as a whole, the rankings might not distinguish advertising content on the page from a feature story, or a feature story from a link. Page rankings can discount the fact that the majority of the page might not have relevant content, but certain blocks of the text might be highly relevant. That means it would rank a page low in your search results even if one paragraph on that page has exactly the info you need. You'll never find it because it's on page ten of the search results.

"It is necessary to segment a Web page into semantically independent units or blocks so that noisy information, such as ads, can be filtered out, and multiple topics can be distinguished," said Ma.

The researchers found that breaking the page up using visual cues takes advantage of the characteristics of a Web page. Web pages contain a lot of visual information in HTML tags and properties. Typical visual hints are lines, blank areas, colors, pictures, and fonts. These visual cues make it easy to detect semantic regions or blocks.

They developed an algorithm called Vision-based Page Segmentation (VIPS) which takes various visual cues into account to find the content structure of a Web page. However, they found that VIPS didn't completely solve the problem because it didn't allow for varying length problems. So they used a combined algorithm that considered both visual cues and length normalization.

Once the Web page is segmented into blocks, the researchers can assign value to each block to determine how closely it might match your search query. They look at the position of the block on the page — blocks closer to the center of a page are usually more important. They look at the size of the block, since larger blocks of content will usually dominate the overall meaning of the page.

They also analyze the links on a page to determine block importance. If a link is a navigational link, or a link to an advertisement, the system will rank the block in which the link is contained of lower importance. This helps remove 'noisy' information such as ads, menus, and decoration from the page ranking.

Though this is still a prototype, they have gotten good results from their initial research. By analyzing the page-to-block relationship, or page layout, and the block-to-page relationship, which is link analysis, they can significantly improve the results you get back on a search query.

How Much is Your Time Worth?



How would you feel if a co-worker barged into your office every few minutes to blurt out updates about their life or project? You might tell them that you are busy, but the damage has been done. You've been interrupted, and getting back on task might be difficult.

11/22/2004

An Exceptionally "EEVL" Search Resource



One of the most respected engineering gateways on the web has just released four new databases providing free access to hundreds of online scientific and technical journals.

There are quite a lot of free acceed database in the web, most of which are more focused on certain fields such as citeseer, smealsearch, etc.
Thanks to all of the provides.

11/21/2004

Google seems more like Microsoft than Microsoft

Google seems more like Microsoft than Microsoft

Not to put too fine a point on it, but Google seems more like Microsoft than Microsoft does. It is like the Borg on Star Trek: Google is a company staffed by legions of cool and hyper-rational PhDs, and it is an irresistible force of nature.

Managing the Firehose of Real-Time Information



RSS feeds, search alerts and other information monitoring technologies are great, but often end up providing too much of a good thing. PubSub is a 'matching engine' that offers a promising new way to keep up to date while alleviating information overload.

AI is the future mainstream

Talk by Adam Bosworth (who left BEA for Google recently) at his ICSOC 2004:

You want to see the future. Don’t look at Longhorn. Look at Slashdot. 500,000 nerds coming together everyday just to manage information overload. Look at BlogLines. What will be the big enabler? Will it be Attention.XML as Steve Gillmor and Dave Sifry hope? Or something else less formal and more organic? It doesn’t matter. The currency of reputation and judgment is the answer to the tragedy of the commons and it will find a way. This is where the action will be. Learning Avalon or Swing isn’t going to matter. Machine learning and inference and data mining will. For the first time since computers came along, AI is the mainstream.

As far as I am concerned, I think our research is becoming more useful hehe. Data mining with XML.