Tom Reamy, who presented on folksonomies and tagging last year, led this session. Since it’s in the San Carlos Ballroom, back to no Wi-Fi access. Then again, from what I’m hearing, even if there’s Internet access in a given room it’s spotty at best.
His talk is about integrating semantics, taxonomy and faceted navigation, with a look at what works and what doesn’t for media sites.
Facet – Orthogonal dimension of metadata
Taxonomy – Aboutness of documents
Ontology – Relationship between entities and facts
Software – Text analytics and auto-categorization
People – Tagging and evaluation of tags, fine-tuning rules and taxonomies, social tagging and suggestions
Facets are metadata attributes (e.g., people, location) and are not identical with categories (which are limited in number and involve aboutness). By orthogonal Reamy means mutual exclusivity –> an event is not a person is not a document is not a place. Facets have a variety of units and structures, and are designed to be used in combination.
Faceted navigation is more intuitive to end-users and allows for dynamic selection of categories (rather than forcing the user to go down a single path to find information). It involves fewer elements – 4 facets of 10 nodes can yield a 10,000 node taxonomy, it’s flexible and easier to maintain.
Taxonomies deal with semantics (meaning, aboutness) and documents and are complementary to facets. Taxonomies support multiple meanings and purposes, and can be relatively small if combined with facets. Formal taxonomies work better (is-a-kind of, is-a-part-of) than broader classifications.
Ontologies deal with relationships between entities – e.g., Vice Presidents have employees and bosses. They can be represented with XML, RDF, OWL, inference rules.
Best approach is dynamic search and browse. Reamy gave the panda, monkey, banana example (which two terms go together best) which I seem to recall Dave Snowden using at KM World last year.
Sample sites:
- Wine shop (pure facets – uses sliders, which is increasingly common)
- Search engine (Source facet – news, video, pictures, Web – with a few selection filters)
- CNNMoney.com (Source facet again, doesn’t allow for detailed drilldown, too many sponsored results)
- Search engine (includes Source and Date facets)
- New York Times (uses semantic technologies to suggest search terms and to offer related stories, has a Source facet identifying section of the ‘newspaper,’ but semantic technologies have issues – Obituaries is one section Obama appears in)
- Forbes.com (chaotic interface)
- Factiva (true faceted navigation – uses Source facet but has sub-categories, graph showing number of documents in a given date range, clusters of co-occurring terms displayed in a tag cloud, multiple traditional facets such as company but also a Subject facet
- Financial Times (limited number of results, but auto-summarization. Standard facets but also taxonomic elements – Topics)
Common themes:
- Balance of commerce and information
- Basic facets – Source and Type
- Standard (People, Companies, Place, Industry)
- Interactive interface (sliders, date ranges)
- Keywords vs. simple taxonomy
- Tag clouds/clusters – how genuinely useful are they? Reamy hints the answer is not very, but doesn’t go into details.
Common issues:
- Advertiser dominance
- Auto-ads
- Non-orthogonal facets (Topics and Issues for one client)
- One or two filters (don’t provide enough intersections)
- Semantic component is still the hardest
- Good information architecture – Summary or full facet display? Simplicity vs. research power
Design issues:
What is the right combination of elements? What is the right balance of elements (Are all facets treated equally?) When should elements be combined – before or after search?
Tools and approach:
Text analytics software extracts entities and noun phrases from a document or set of documents, and can be designed to feed facets,signature, ontologies.
Auto-categorization software feeds subject facets, can be ‘taught’ using training sets, a set of terms (with literal strings, stemming and a dictionary of related terms), simple rules such as position in text, saved search queries, Boolean. Advanced features can include fact extraction, sentiment analysis.
Entity extraction can be dictionary-based, rules-based. A collection of entities can be the aboutness of a document.
Documents are more complicated than products! Facets can’t be an add-on. There has been some progress on semantics. Future of search will be smarter ways to refine results, not better relevance.
When do you add metadata and how? Depends on the environment. Using content management in the enterprise one can balance taggers, results from software and company policies. Software can suggest categorization and facet values. Relevance is best based on ontologies.
10/27/08 UPDATE: Tom’s presentation is up.
Technorati Tags: IL2008