Below are the titles/authors/abstracts of six papers presented
in Session 5D: Biomedical Mining of:

Advances in Knowledge Discovery and Data Mining
8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, 
May 26-28, 2004, Proceedings
Series: Lecture Notes in Computer Science
Subseries: Lecture Notes in Artificial Intelligence , Vol.  3056
Dai, Honghua; Srikant, Ramakrishnan; Zhang, Chengqi (Eds.)
2004, XIX, 713 p. Also available online., Softcover
ISBN: 3-540-22064-X
Publisher: Springer Verlag
Available online in SpringerLink 
DOI: 10.1007/b97861

The information below was kindly gathered by and contributed
by Oriane Matte-Tailliez, Oriane.Matte@lri.fr http://www.lri.fr/~oriane/

Conceptual Mining of Large Administrative Health Data
Tatiana Semenova1 , Markus Hegland2 , Warwick Graco3  and Graham Williams4
(1)  CSL, RSISE, Australian National University,
(2)  School of Mathematical Sciences, Australian National University,
(3)  Australian Health Insurance Commission,
(4)  Mathematical and Information Sciences, CSIRO,
Abstract
Health databases are characterised by large number of records, large
number of attributes and mild density. This encourages data miners to use
methodologies that are more sensitive to health undustry specifics. For
conceptual mining, the classic pattern-growth methods are found limited
due to their great resource consumption. As an alternative, we propose a
pattern splitting technique which delivers as complete and compact
knowledge about the data as the pattern-growth techniques, but is found to
be more efficient.

A Semi-automatic System for Tagging Specialized Corpora
Ahmed Amrani1 , Yves Kodratoff2  and Oriane Matte-Tailliez2
(1)  ESIEA Recherche, 9 rue VŽsale, 75005 Paris, France
(2)  LRI, UMR CNRS 8623, B‰t. 490, UniversitŽ de Paris-Sud 11, 91405
Orsay, France
Abstract
In this paper, we treat the problem of the grammatical tagging of
non-annotated corpora of specialty. The existing taggers are trained on
general language corpora, and give inconsistent results on the specialized
texts, as technical and scientific ones. In order to learn rules adapted
to a specialized field, the usual approach labels manually a large corpus
of this field. This is extremely time-consuming. We propose here a
semi-automatic approach for tagging corpora of specialty. ETIQ, the new
tagger we are building, make it possible to correct the base of rules
obtained by Brills tagger and to adapt it to a corpus of specialty. The
user visualizes an initial and basic tagging and corrects it either by
extending Brills lexicon or by the insertion of specialized lexical and
contextual rules. The inserted rules are richer and more flexible than
Brills ones. To help the expert in this task, we designed an inductive
algorithm biased by the correct knowledge he acquired beforehand. By using
techniques of machine learning and enabling the expert to incorporate
knowledge of the field in an interactive and friendly way, we improve the
tagging of specialized corpora. Our approach has been applied to a corpus
of molecular biology.

A Tree-Based Approach to the Discovery of Diagnostic Biomarkers for
Ovarian Cancer
Jinyan Li1  and Kotagiri Ramamohanarao2
(1)  Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore
119613,
(2)  Dept. of CSSE, The University of Melbourne, VIC 3010, Australia
Abstract
Computational diagnosis of cancer is a classification problem, and it has
two special requirements on a learning algorithm: perfect accuracy and
small number of features used in the classifier. This paper presents our
results on an ovarian cancer data set. This data set is described by 15154
features, and consists of 253 samples. Each sample is referred to a woman
who suffers from ovarian cancer or who does not have. In fact, the raw
data is generated by the so-called mass spectrosmetry technology measuring
the intensities of 15154 protein or peptide-features in a blood sample for
every woman. The purpose is to identify a small subset of the features
that can be used as biomarkers to separate the two classes of samples with
high accuracy. Therefore, the identified features can be potentially used
in routine clinical diagnosis for replacing labour-intensive and expensive
conventional diagnosis methods. Our new tree-based method can achieve the
perfect 100% accuracy in 10-fold cross validation on this data set.
Meanwhile, this method also directly outputs a small set of biomarkers.
Then we explain why support vector machines, naive bayes, and k-nearest
neighbour cannot fulfill the purpose. This study is also aimed to
elucidate the communication between contemporary cancer research and data
mining techniques.
Keywords: Decision trees, committee method, ovarian cancer, biomarkers,
classification.

A Novel Parameter-Less Clustering Method for Mining Gene Expression Data
Vincent Shin-Mu Tseng1  and Ching-Pin Kao1
(1)  Department of Computer Science and Information Engineering, National
Cheng Kung University, Tainan, Taiwan, R.O.C.,
Abstract
Clustering analysis has been applied in a wide variety of fields. In
recent years, it has even become a valuable and useful technique for
in-silico analysis of microarray or gene expression data. Although a
number of clustering methods have been proposed, they are confronted with
difficulties in the requirements of automation, high quality, and high
efficiency at the same time. In this paper, we explore the issue of
integration between clustering methods and validation techniques. We
propose a novel, parameter-less, and efficient clustering algorithm,
namely CST, which is suitable for analysis of gene expression data.
Through experimental evaluation, CST is shown to outperform other
clustering methods substantially in terms of clustering quality,
efficiency, and automation under various types of datasets.

Extracting and Explaining Biological Knowledge in Microarray Data
Paul J. Kennedy1 , Simeon J. Simoff1 , David Skillicorn2  and Daniel
Catchpoole3
(1)  Faculty of Information Technology, University of Technology, Sydney,
PO Box 123, Broadway, NSW 2007, Australia
(2)  School of Computing, Queens University, Kingston, Ontario, Canada
(3)  The Oncology Research Unit, The Childrens Hospital at Westmead,
Locked Bag 4001, Westmead NSW 2145, Australia
Abstract
This paper describes a method of clustering lists of genes mined from a
microarray dataset using functional information from the Gene Ontology.
The method uses relationships between terms in the ontology both to build
clusters and to extract meaningful cluster descriptions. The approach is
general and may be applied to assist explanation of other datasets
associated with ontologies.
Keywords: Cluster analysis, bioinformatics, cDNA microarray.

Further Applications of a Particle Visualization Framework
Ke Yin1  and Ian Davidson1
(1)  SUNY-Albany, Department of Computer Science 1400 Washington
Ave.Albany, NY, USA, 12222.,
Abstract
Previous work introduced a 3D particle visualization framework that viewed
each data point as a particle affected by gravitational forces. We showed
the use of this tool for visualizing cluster results and anomaly
detection. This paper generalizes the particle visualization framework and
demonstrates further applications such as determining the number of
clusters and identifies clustering algorithm biases. We dont claim
visualization itself is sufficient in answering these questions. The
methods here are best used when combined with other visual and analytic
techniques. We have made our visualization software that produces standard
VRML available to allow its use for these and other applications.
Our software is available at www.cs.albany.edu/~davidson/ParticleViz. We
strongly encourage readers to refer the 3D visualizations at the above
address while reading the paper. The visualizations can be viewed by any
internet browser with a VRML plug-in
(http://www.parallelgraphics.com/products/downloads/).