HOME  •   PROGRAM  •   TSUJIILAB  •  
AUTHOR'S ABSTRACTS: INVITED SPEAKERS
Building biolexical resources: techniques and challenges
Ananiadou, Sophia
New biological discoveries are being reported at an outstanding rate. This information is registered in scientific literature, public databases and ontologies. In order to link biological knowledge with textual information in various resources we need techniques that discover and structure new terms in literature. In order to bridge the gap between biological knowledge (ontologies, controlled vocabularies) and literature, we explore techniques for managing term variability and ambiguity. The aim is to build and update linguistically rich lexical resources for the biomedical domain by discovering new terms and term variants in literature and clustering these into equivalence classes.
UIMA Overview and Approach to Interoperability
Brown, Eric, Slides
IBM began work on the Unstructured Information Management Architecture (UIMA) in 2001 in an effort to enable sharing across IBM's various research projects in unstructured information management and to facilitate technology transfer into IBM products. The UIMA mission quickly grew beyond a purely internal project as UIMA's potential to enable sharing and interoperability between a broad range of independently developed multi-modal analysis components became clear. Over the last several years, IBM has developed UIMA into a robust architecture and framework for unstructured information management, collaborated with several government agencies, universities, and companies on UIMA-based projects, released a UIMA SDK on IBM AlphaWorks, published the source code for the UIMA Framework, created an Apache UIMA incubator open source project, and initiated an OASIS standards effort around UIMA. In this talk I will briefly review the UIMA project, explain the key architectural concepts and components in UIMA, and discuss the UIMA approach to supporting interoperability for unstructured information analysis components.
New Paradigms for Machine Translation and Information Retrieval
Carbonell, Jaime, Slides
Language Technologies, like many other sciences, advances both via gradual improvements in established methods and occasionally via more radical paradigm shifts. Two instances of the latter are presented.

Machine Translation received a major boost with advent of corpus-based methods: Example-Based MT and especiall Statistical MT. However, both paradigms require large quantities of professionally-translated parallel text. We present a new paradigm: Context-Based MT that requires only large quantities of monolingulal text, easily acquirable via web-spidering and other means. Moreover, the new paradigm yields very high accuracy MT.

Information Retrieval has focused on finding relevant documents, given a user's query. Google goes beyond relevance by adding popularity measures via pagerank. But, are there other retrieval criteria, such as novelty, trustworthiness, and comprehensibility? We discuss how such metrics may come into play, especially in learning user profiles to improve retrieval.

SciBorg: Deep Processing and Chemical Informatics
Copestake, Ann, Slides
The objective of the SciBorg project is to use combined deep and shallow language processing methods to produce a semantic representation which is compatible with Semantic Web standards and which will support a variety of information extraction tasks on Chemistry texts. In this talk, I will discuss the progress on the project, concentrating in particular on aspects of the applications which require relatively deep language processing.
Recognizing Contradictions in Text Mining
Harabagiu, Sanda, Slides
Contradictions between multiple text sources can be recognized in text mining by relying on three forms of linguistic information: (a) negation; (b) antonymy; and (c) semantic and pragmatic information associated with the contrast discourse relations. Two views of contradictions are considered, in which a novel method of recognizing contrast and of finding antonymies are described. Contradictions are used for informing fusion operators in question answering and text mining. The talk shall discuss experiments that show promising results for the detection of contradictions.
Information Extraction and Annotation as a Methodology for Complex Domain Modeling
Hovy, Eduard, Slides
In the past, Information Extraction (IE) systems focused on certain well-defined classes of entities (people, organizations, locations, etc.) and on certain easily specified kinds of events. But recently, IE seems to have grown increasingly in demand, and the kinds of items desired by domain experts seem to be increasingly complex and hard to define. In our experiments in several domains (biomedicine, government intelligence, and eGovernment), we have found that what the domain expert needs from an IE engine is usually neither easy to extract nor even easy to define. In fact, it seems rather common that experts, in trying to specify what exactly they want, goes through a process of discovery that frequently leaves them surprised at how little they understood some of the details of their own domain. Our solution has been to engage experts in a process of annotation, by which they highlight in text examples of what they need, followed by joint analysis and decomposition of the complex concepts present. Following this, we train IE engines to extract the information they desire, and use low performance as an indicator of potential problems in the definition of the target concepts. Using this methodology, we have learned to factor into extractable types some surprisingly sophisticated notions in psychology and to identify argumentation structure in emails.
The NAIST Text Corpus and Predicate-Argument Structure Analysis
Inui, Kentaro, Slides
A crucial part of predicate-argument structure analysis is the identification of omitted arguments of predicates. While having significant overlap with Propbank-style semantic role labeling (SRL), this subtask can be seen as a special case of coreference/anaphora resolution (AR) in the sense that it searches for the antecedents of zero-anaphors (omitted arguments). In spite of this overlap between SRL and AR, there are some important findings that are yet to be exchanged between them, partly because the two fields have been evolving somewhat independently. This talk will start with an brief introduction to our predicate-argument structured-tagged corpus (called the NAIST Japanese Text Corpus) addressing several design issues, and then will present a machine learning-based model for predicate-argument structure analysis that benefits from the state of the art of both SRL and AR, followed by a report on the current results of our experiments.
Bayesian Inference of Grammars
Johnson, Mark, Slides
Even though Maximum Likelihood Estimation (MLE) of Probabilistic Context-Free Grammars (PCFGs) is well-understood (the Inside-Outside algorithm can do this efficiently from the terminal strings alone) the inferred grammars are usually linguistically inaccurate. In order to better understand why maximum likelihood finds poor grammars, this talk examines two simple natural language induction problems: morphological segmentation and word segmentation. We identify several problems with the MLE PCFG models of these problems and propose Hierarchical Dirichlet Process (HDP) models to overcome them. In order to test these HDP models we develop MCMC algorithms for Bayesian inference of these models from strings alone. Finally, we discuss to what extent the lessons learnt from these examples can be put into a unified framework and applied to the general problem of grammar induction. Joint work with Sharon Goldwater and Tom Griffiths.
The Extraction of Enriched Protein-Protein Interactions from Biomedical Text
Haddow, Barry, Slides
There has been much recent interest in automatically identifying protein-protein interactions (PPIs) in biomedical literature, in order to assist the human curation of databases. However, such databases typically require additional information about the interactions, such as the experimental method used to detect the interaction, and the names of any drugs used to influence the behaviour of the proteins. Furthermore, curators may only be interested in interactions which are experimentally proven within the paper, or where the proteins physically touch during the interaction.

This talk describes a system which not only extracts mentions of PPIs from biomedical text, but also enriches those PPIs with additional information of biological interest. The enriched information consists of properties (i.e., functions which map from a PPI to a predefined finite set of classes) and attributes (i.e., relations between PPIs relation or between a PPI's participating entities and other entities). An example property is *IsProven*, which expresses whether the interaction is (i) experimentally demonstrated in the paper, (ii) referenced from another paper, or (iii) has no specified source. An example attribute is *ModificationAfterEntity* which relates an entity in an interaction to a modification (such as phosphorylation) that results from that interaction.

In order to apply machine learning to this task, and to provide data for testing, a total of 220 papers selected from PubMed and PubMedCentral were annotated by trained biologists with appropriate entities, relations, attributes and properties. We implemented the PPI extractor using a maximum entropy model trained on a variety of shallow linguistic features (such as context, parts-of-speech, chunks, and simple patterns) that had been extracted from the training data. The association of properties with relations was also implemented with a maximum entropy model, using mainly unigram and bigram features. For attribute recognition, we developed both rule- based and machine-learning approaches. Current performance (F1) is approximately 55% for PPI extraction, 85% for property extraction, and 45% for attribute extraction. These results are very promising relative to the level of inter-annotator agreement on the annotation task.

Information Credibility Criteria Project
Kurohashi, Sadao, Slides
Along with the rapid progress of computers and computer networks, a very huge volume of linguistic information such as web documents, emails and enterprise documents has been accumulated and circulated. Such information gives judgement criteria for people's daily life, and is starting to have a strong influence on governmental policy decision and enterprise management. It would be a fundamental and necessary technology for the healthy society from now on to extract credible information related to a given topic/query out of huge documents, and organize it, clarifying background, facts, opinions, and opinion distribution and so on. This project addresses overall research and development related to such information credibility criteria.
Learning to Rank A new technology for text information processing
Li, Hang, Slides
Ranking is the key issue for many text processing tasks. They include document retrieval, collaborative filtering, key term extraction, expert finding, important email routing, sentiment analysis, product rating, and anti web spam. In ranking, given a set of entities, the ranking model assigns scores to the entities and sort the entities in descending order of the scores. The scores may represent the degrees of relevance, preference, or importance, depending on applications. Learning to rank is aimed at automatically creating a ranking model using some training data and machine learning techniques. Learning to rank has been gaining increasing attention in information retrieval, natural language processing, data mining, and others related fields. Many methods for learning to rank have been proposed. In this talk, I will give a survey on learning to rank and introduce recent research on the technology conducted at Microsoft Research Asia.
Statistical NLP: From linguistic strip mining to deep linguistic processing
Manning, Chris, Slides
Statistical NLP began doing very superficial counts of word patterns in text, but recently much work has moved to using machine learning methods on deeper linguistic representations. In this talk I want to focus on two aspects: (1) How much of a gap is there between traditional deep linguistic processing and the alternative tool chain that has emerged as the standard in statistical NLP work. (2) How can we extend machine learning methods to deal with questions of semantics as well as sentence structure?
Text Analysis and Knowledge Mining (TAKMI) at Contact Centers -- From written summary to spoken conversation--
Nasukawa, Tetsuya, Slides
Text mining has been adopted in various customer contact centers for identifying valuable business insights. Because of rising customer expectations and improvements in speech recognition technology, the type of data in contact records for text mining is shifting from written summaries of each contact to complete transcripts of the spoken conversations. This talk gives an overview of text mining applications for contact centers and results of a feasibility study for spoken conversation mining.
Linking two disparate sets of articles in MEDLINE
Smalheiser, Neil R., Slides
Identifying information that implicitly links two disparate sets of articles is a fundamental and intuitive data mining strategy that can help investigators address real scientific questions. The Arrowsmith system finds title words and phrases (so-called B-terms) that are shared across two literatures within MEDLINE and displays them in a manner that facilitates human assessment. Using a public two node search interface, field testers devised a set of two node searches under real life conditions and marked relevant B-terms. These were employed as "gold standards;" each B-term was characterized according to eight complementary features that were strongly correlated with relevance. A logistic regression model was developed that permits one to estimate the probability of relevance for each B-term, to rank B-terms according to their likely relevance, and to estimate the overall number of relevant B-terms inherent in a given two node search.
Creating Biomedical Resources with NLP-based Information Extraction
Park, Jong C.
We have been using combinatory categorial grammar for the accurate extraction of diverse kinds of information from various natural language texts, and creating resources with such extracted information. In this talk, I will focus on the description of our approaches to biomedicine, after briefly introducing the general research methodologies that we take. In particular, I will explain our recent progress in semi-automatic ontology extension, gene summary generation, and E3 database creation.
Knowledge-intensive Approach to Man-Web Interaction
Torisawa, Kentaro, Slides
I'm going to talk about our on-going project that aims at helping Internet users to obtain useful information from the Web and to communicate with each other by using a large-scale knowledge base, that were partially constructed by automatic acquisition from Web documents.
Building Fast and Accurate Taggers for Biomedical Text
Tsuruoka, Yoshimasa
Most bio-text mining systems need part-of-speech (POS) information on the words, and the accuracy of POS tagging has a great impact on the performance of subsequent processes such as parsing and relation extraction. I will talk about how to build a fast and accurate POS tagger with log-linear models, and then present some techniques for domain adaptation, which can reduce the cost for building a tagger customized for biomedical text.
Bootstrapping Relation Extraction Grammars from Semantic Seeds
Uszkoreit, Hans and Feiyu, Xu, Slides
We will present a new minimally supervised machine learning framework for extracting relations of various complexity. Bootstrapping starts from a small set of n-ary relation instances as "seeds" in order to automatically learn pattern rules from parsed data, which then can extract new instances of the relation and its projections. We propose a novel rule representation model that enables the composition of n-ary relation rules on top of the rules for projections of the relation. The compositional approach to rule construction is supported by a bottom-up pattern extraction method working on dependency structures. In comparison to other automatic approaches, our rules cannot only localize relation arguments but also assign their exact target argument roles. The evaluation results compare favorably with those of existing pattern acquisition approaches in both recall and precision. For one extraction task a single seed event suffices to get patterns that find most of the relevant events. For another task we need larger number of seed events in order to get a satisfactory performance. We use known results from graph theory to describe the relevant differences between extraction domains and propose some strategies for improving the selection and acquisition of effective seeds.
Protein Function Inference Enhanced by Text Mining
Wong, Limsoon, Slides
Protein function prediction is a key problem in computational biology. It has traditionally been accomplished primarily using "guilt by association"of sequence similarity. However, if good sequence similarity is unavailable, one must appeal to guilt by association of other types of similarity and even to combination of multiple types of similarity information. In this talk, we first present a framework for the fusion of multiple types of similarity information, and then we investigate simple co-occurrences of protein names in MEDLINE abstracts as a form of similarity information. We demonstrate that, for protein function prediction, (1) our similarity information fusion method works well, (2) simple co-occurrence count gives reasonable sensitivity & precision, and (3) combining multiple information sources outperforms any single information source.
AUTHOR'S ABSTRACTS: PROJECT PRESENTATION
Linking Text with Knowledge - Challenges in Text Mining for Biology
Tsujii, Jun'ichi
With an overwhelming amount of biomedical knowledge recorded in texts, it is not surprising that there is so much interest in techniques which can identify, extract, manage, integrate and exploit this knowledge, moreover discover new, hidden or unsuspected knowledge. The number of MEDLINE searches in January 1997 was 0.163 million compared to 82.027 millions in March 2006. MEDLINE contains approximately 15 million bibliographic units and its size is increasing at a rate of more than 10 % each year. The demand for tools dealing with ever-increasing knowledge embedded in text is real.

While there are a few TM tools on market, they hardly satisfy actual requirements of biologists. Simple application of data mining techniques to text does not work. Since language and text have their own inherent structures, it is essential for TM tools to be able to recognize and exploit their structures to explicitly capture information encoded in them. However, although diverse techniques have been developed in NLP research, they were considered, until very recently, non-deployable for large scale text mining. In this talk, I will argue that NLP-based techniques have become robust and efficient enough for large scale text mining applications, and demonstrate two systems, MEDIE and Info-Pubmed, to show how we can use the technology of deep parsing in real world applications. I also discuss possible applications of NLP-based Text Mining in the near future.

Acquiring a wide-coverage lexicalized grammar from Penn Treebank
Miyao, Yusuke, Slides
Difficulty in the development of wide-coverage lexicalized grammars has been an obstacle to the application of deep parsing to real NLP tasks. In this talk, I present a method for acquiring a wide-coverage HPSG grammar at low cost. The approach is to convert Penn Treebank into an HPSG treebank. Since the Penn Treebank includes annotations for traces and coindexes, they are exploited for annotating HPSG-conformant features such as SLASH. I explain several syntactic constructions and describe how they are treated in our approach. The acquired grammar is evaluated in terms of coverage for real-world sentences.
Corpus, Ontology and Annotation: Mapping Natural Language Expressions with Facts
Kim, Jin-Dong, Slides
An ontology of a subject domain defines and classifies entities in the domain. The entities then become vocabulary for discrete descriptions of knowledge or facts. In literature, information is written in natural language texts, which means the information is not directly accessible by computers. Information extraction (IE) is the process of finding information which is originally written in natural langauge and describing it in a descrete representation. From a perspective of IE, corpus annotation simulates ideal performance, mapping facts in descrete desription with corresponding natural language expressions. I will present our methods and results in annotating corpora to support the development of IE systems for biomedical literature.
HPSG Parsing with Shallow Dependency Constraints
Sagae, Kenji
While linguistically motivated lexicalized grammar formalisms are capable of richer syntactic analysis than many of the common dependency and PCFG-based approaches, building data-driven disambiguation models for complex syntactic structures is generally more difficult than building models for simpler formalisms. I will present a framework for deep syntactic analysis with HPSG parsing that takes advantage of highly accurate models of shallow syntax represented by surface dependencies. By using bilexical surface dependencies to constrain the application of wide-coverage HPSG rules, we can benefit from a number of models designed for high accuracy dependency parsing, while actually performing deep syntactic analysis. The use of recently developed dependency parsing techniques results in a significant improvement in accuracy over state-of-the-art wide-coverage HPSG parsing.
Efficient HPSG Parsing with Supertagging and CFG-filtering
Matsuzaki, Takuya
One of the main difficulties in using lexicalized grammars for large-scale NLP applications has been the inefficiency of parsing caused by the complicated data structures used in the grammars. Recent research showed supertagging is a very effective technique for improving the efficiency of the parsing with large lexicalized grammars. I will talk about a parsing system for HPSG, where the supertagging technique is combined with another pre-processing technique called CFG-filtering. The system gives six-fold speed-up compared to the best previous result by a supertagger-based parser.
Reranking for biomedical named-entity recognition
Yoshida, Kazuhiro
This research investigates improvement of automatic biomedical named-entity recognition by applying a reranking system to the COLING 2004 JNLPBA shared task of bioentity recognition. The system consists of a pipeline of two statistical classifiers, the n-best tagger and its reranker, both of which are based on log-linear models. The architecture enables the reranker to take advantage of features which are globally dependent on the label sequences, and features from the labels of other sentences than the target sentence. According to our experimental results, the reranker contributed to 1.55 points of F-score improvement over the single-best output of the n-best tagger. We examine the features that contributed to the improvement, and discuss possible directions for further improvements.

COPYRIGHT © TSUJIILAB, UNIVERSITY OF TOKYO. ALL RIGHTS RESERVED.