Below are the titles/authors/abstracts of six papers presented in Session 5D: Biomedical Mining of: Advances in Knowledge Discovery and Data Mining 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26-28, 2004, Proceedings Series: Lecture Notes in Computer Science Subseries: Lecture Notes in Artificial Intelligence , Vol. 3056 Dai, Honghua; Srikant, Ramakrishnan; Zhang, Chengqi (Eds.) 2004, XIX, 713 p. Also available online., Softcover ISBN: 3-540-22064-X Publisher: Springer Verlag Available online in SpringerLink DOI: 10.1007/b97861 The information below was kindly gathered by and contributed by Oriane Matte-Tailliez, Oriane.Matte@lri.fr http://www.lri.fr/~oriane/ Conceptual Mining of Large Administrative Health Data Tatiana Semenova1 , Markus Hegland2 , Warwick Graco3 and Graham Williams4 (1) CSL, RSISE, Australian National University, (2) School of Mathematical Sciences, Australian National University, (3) Australian Health Insurance Commission, (4) Mathematical and Information Sciences, CSIRO, Abstract Health databases are characterised by large number of records, large number of attributes and mild density. This encourages data miners to use methodologies that are more sensitive to health undustry specifics. For conceptual mining, the classic pattern-growth methods are found limited due to their great resource consumption. As an alternative, we propose a pattern splitting technique which delivers as complete and compact knowledge about the data as the pattern-growth techniques, but is found to be more efficient. A Semi-automatic System for Tagging Specialized Corpora Ahmed Amrani1 , Yves Kodratoff2 and Oriane Matte-Tailliez2 (1) ESIEA Recherche, 9 rue Vˇsale, 75005 Paris, France (2) LRI, UMR CNRS 8623, B‰t. 490, Universitˇ de Paris-Sud 11, 91405 Orsay, France Abstract In this paper, we treat the problem of the grammatical tagging of non-annotated corpora of specialty. The existing taggers are trained on general language corpora, and give inconsistent results on the specialized texts, as technical and scientific ones. In order to learn rules adapted to a specialized field, the usual approach labels manually a large corpus of this field. This is extremely time-consuming. We propose here a semi-automatic approach for tagging corpora of specialty. ETIQ, the new tagger we are building, make it possible to correct the base of rules obtained by Brills tagger and to adapt it to a corpus of specialty. The user visualizes an initial and basic tagging and corrects it either by extending Brills lexicon or by the insertion of specialized lexical and contextual rules. The inserted rules are richer and more flexible than Brills ones. To help the expert in this task, we designed an inductive algorithm biased by the correct knowledge he acquired beforehand. By using techniques of machine learning and enabling the expert to incorporate knowledge of the field in an interactive and friendly way, we improve the tagging of specialized corpora. Our approach has been applied to a corpus of molecular biology. A Tree-Based Approach to the Discovery of Diagnostic Biomarkers for Ovarian Cancer Jinyan Li1 and Kotagiri Ramamohanarao2 (1) Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore 119613, (2) Dept. of CSSE, The University of Melbourne, VIC 3010, Australia Abstract Computational diagnosis of cancer is a classification problem, and it has two special requirements on a learning algorithm: perfect accuracy and small number of features used in the classifier. This paper presents our results on an ovarian cancer data set. This data set is described by 15154 features, and consists of 253 samples. Each sample is referred to a woman who suffers from ovarian cancer or who does not have. In fact, the raw data is generated by the so-called mass spectrosmetry technology measuring the intensities of 15154 protein or peptide-features in a blood sample for every woman. The purpose is to identify a small subset of the features that can be used as biomarkers to separate the two classes of samples with high accuracy. Therefore, the identified features can be potentially used in routine clinical diagnosis for replacing labour-intensive and expensive conventional diagnosis methods. Our new tree-based method can achieve the perfect 100% accuracy in 10-fold cross validation on this data set. Meanwhile, this method also directly outputs a small set of biomarkers. Then we explain why support vector machines, naive bayes, and k-nearest neighbour cannot fulfill the purpose. This study is also aimed to elucidate the communication between contemporary cancer research and data mining techniques. Keywords: Decision trees, committee method, ovarian cancer, biomarkers, classification. A Novel Parameter-Less Clustering Method for Mining Gene Expression Data Vincent Shin-Mu Tseng1 and Ching-Pin Kao1 (1) Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C., Abstract Clustering analysis has been applied in a wide variety of fields. In recent years, it has even become a valuable and useful technique for in-silico analysis of microarray or gene expression data. Although a number of clustering methods have been proposed, they are confronted with difficulties in the requirements of automation, high quality, and high efficiency at the same time. In this paper, we explore the issue of integration between clustering methods and validation techniques. We propose a novel, parameter-less, and efficient clustering algorithm, namely CST, which is suitable for analysis of gene expression data. Through experimental evaluation, CST is shown to outperform other clustering methods substantially in terms of clustering quality, efficiency, and automation under various types of datasets. Extracting and Explaining Biological Knowledge in Microarray Data Paul J. Kennedy1 , Simeon J. Simoff1 , David Skillicorn2 and Daniel Catchpoole3 (1) Faculty of Information Technology, University of Technology, Sydney, PO Box 123, Broadway, NSW 2007, Australia (2) School of Computing, Queens University, Kingston, Ontario, Canada (3) The Oncology Research Unit, The Childrens Hospital at Westmead, Locked Bag 4001, Westmead NSW 2145, Australia Abstract This paper describes a method of clustering lists of genes mined from a microarray dataset using functional information from the Gene Ontology. The method uses relationships between terms in the ontology both to build clusters and to extract meaningful cluster descriptions. The approach is general and may be applied to assist explanation of other datasets associated with ontologies. Keywords: Cluster analysis, bioinformatics, cDNA microarray. Further Applications of a Particle Visualization Framework Ke Yin1 and Ian Davidson1 (1) SUNY-Albany, Department of Computer Science 1400 Washington Ave.Albany, NY, USA, 12222., Abstract Previous work introduced a 3D particle visualization framework that viewed each data point as a particle affected by gravitational forces. We showed the use of this tool for visualizing cluster results and anomaly detection. This paper generalizes the particle visualization framework and demonstrates further applications such as determining the number of clusters and identifies clustering algorithm biases. We dont claim visualization itself is sufficient in answering these questions. The methods here are best used when combined with other visual and analytic techniques. We have made our visualization software that produces standard VRML available to allow its use for these and other applications. Our software is available at www.cs.albany.edu/~davidson/ParticleViz. We strongly encourage readers to refer the 3D visualizations at the above address while reading the paper. The visualizations can be viewed by any internet browser with a VRML plug-in (http://www.parallelgraphics.com/products/downloads/).