Tools for Systems Biology

Course Description:

We will cover advanced software development techniques such as adpative and aspect-oriented software development. To apply those techniques to interesting problems we need an application domain. Because I spent my sabbatical 2005/2006 at Novartis in the Systems Biology group, we will choose tools for Systems Biology as application domain. Systems biology models the quantitative and simultaneous integration of different biological components and their relationships with one another.

No background in biology and modeling techniques in biology is expected. Students taking this course should have had an undergraduate course in algorithms. Having taken CS G113 oe CS G713 is beneficial but is not expected. Prerequisites: CSG 111 or CSG 711.

Systems Biology is a big area and we will focus on reasoning under uncertainty. Normally, if a system is constrained by a set of constraints, all constraints must hold in a solution. In biology (and other applications), we are not sure that all constraints must hold in a solution. Instead, we assign a positive weight to each constraint that indicates its importance. Instead of satisfying all constraints, we look for a solution that satisfies as many constraints as possible.

The Maximum Satisfiability (MAX-SAT) problem is an example of a constraint satisfaction problem and it plays an important role (also outside Systems Biology). Bayesian Networks are important in Systems Biology and there is a strong similarity between the MPE problem and MAX-SAT (Park 2002). The Most Probable Explanation (MPE) is the problem of finding the variable instantiation of a Bayesian network that has the higest probability given some evidence.

Markov Logic Networks (Pedro Domingos et al. 2005) combine probability and first-order logic in a single representation and to find the most likely world involves solving a MAX-SAT problem. Markov Logic Networks are believed to be important in Systems Biology. In Markov Logic Networks, the constraints are first-order predicate logic formulas. By normalizing and grounding those formulas, a MAX-SAT problem is obtained.

The MAX-SAT problem also plays an important role in inferring protein interactions (Zhang et al.). The SAT problem (MAX-SAT where all clauses are satisfied) is also important in haplotype inference (Ines Lynce et al. 2006). Using SAT, large instances can be solved.

Given the central role of MAX-SAT we will first apply advanced software development techniques to implement MAX-SAT and its variations. We will focus on the Generalized MAX-SAT problem where the constraints are formulated in terms of a fixed set of relations. Generalized MAX-SAT research falls into two categories:

1. Algorithms that do well in practice (based on empirical comparisons)

2. Algorithms that do well in theory. There are two camps: (2.1) algorithms that come provably close to the optimum. There is a large number of specialized results in this area. (2.2) algorithms that provably satisfy a fraction of the total number of constraints. There is a very nice uniform algorithm (called MAXMEAN) that solves the problem in an optimal way: If there would be a provably better algorithm, then P=NP. It is this uniform algorithm that we are going to implement using adaptive and aspect-oriented programming and we will compare its performance with other published algorithms. This comparison has not been done before so there is a good chance that we discover some new, exciting knowledge during the course. Of course, my hope is that MAXMEAN will outperform the currently best algorithms.

Why is MAXMEAN an interesting candidate for implementation?

1. I know it intimately: I have developed it with Ernst Specker while at ETH Zurich and Princeton University: P-Optimal algorithms.

2. Since the algorithm was developed some 30 years ago, the importance of the satisfiability problem steadily increased. There are conferences dedicated to satisfiability: SAT Live.

3. While the original Journal of the ACM paper:

```@article{322260,
author = {K. J. Lieberherr and E. Specker},
title = {Complexity of Partial Satisfaction},
journal = {J. ACM},
volume = {28},
number = {2},
year = {1981},
issn = {0004-5411},
pages = {411--421},
doi = { http://doi.acm.org/10.1145/322248.322260 },
publisher = {ACM Press},
address = {New York, NY, USA},
}
```
has been frequently cited by the theoretical algorithms community, the more practical paper:
```Karl J. Lieberherr, Algorithmic extremal problems in combinatorial optimization,
Journal of Algorithms, Volume 3, Issue 3, , September 1982, Pages 225-244.
(http://www.sciencedirect.com/science/article/B6WH3-4D7JMY4-2X/2/0013e87a3b79ff12cd3c71596646ac77)
```
seems to be forgotten by the empirical algorithms community. One reason is that the Journal of Algorithms paper is safely hidden behind Elsevier's web page although it would be much more accessible to practically motivated computer scientists athan the JACM paper.

3. MAXMEAN is very simple, broadly applicable, provably optimal in a very interesting way and we have an excellent opportunity to compare it with other practically useful MAXSAT algorithms.

MAXMEAN is basically a randomized algorithm that has been derandomized using the Method of Conditional Expectations (Erdoes/Selfridge 1973). The randomized algorithm uses carefully bent coins. MAXMEAN is actually a family of algorithms, one algorithm for each fixed family of relations. Adaptive Programming (AP) is an advanced software development technique that is designed for implementing families of algorithms and it will be interesting to explore AP on MAXMEAN.

Aspect-Oriented Programming (AOP) will be used to incrementally develop the algorithm by adding new features using aspects. AOP will also be used to instrument the algorithm with statistics without modifying the source code of the basic algorithm. AOP will be used to turn the maximization algorithm into a minimization algorithm using an aspect.

There is an extensive literature on empirically comparing algorithms for MAX-SAT. We will have to find the best and compare with our own implementation. Algorithms we should consider are: Discrete Lagrangian Multipliers (DLM) (Wah and Shang 1997) and Guided Local Search (GLS) (Mills and Tsang 2000) and MaxWalkSat. We want a useful general purpose tool for systems biology that is easily adaptable and specializable. An AOP organization should help.

A new programming technology is emerging, called Aspect-Oriented Programming (AOP) and more generally, Aspect-Oriented Software Development (AOSD, see aosd.net). This technology provides well-modularized implementations of crosscutting concerns leading to more flexible and simpler programs. Adaptive Object-Oriented Software Development is an interesting subfield of AOSD focussing on a very useful family of crosscutting concerns, called traversal-related concerns. Conversely, we consider aspect-oriented software development as a special case of a general notion of adaptive programming in that good aspect-oriented programs are adaptive.

This course provides state-of-the-art techniques and concepts for software development with a focus on proper separation of concerns. We will review the history of software development and encounter different techniques for separation of concerns like functions and objects. We will identify limitations in current software development practice that lead to bad separation of concerns. We will touch on general-purpose aspect-oriented techniques (AspectJ) that lead to better separation of concerns. Then we will identify limitations in those general-purpose techniques and point to special purpose aspect-oriented techniques. We will use the Demeter Method as an example of a special purpose aspect-oriented technique.

This course introduces both the aspect-oriented and adaptive approach to software development and compares them to other approaches. Loose coupling between software artifacts is a theme used throughout the course. Specifically, we will learn about loose coupling between structure and behavior which leads to adaptiveness. Adaptive programming views "structure" as an aspect which crosscuts behavior. We will also look at many other aspects such as synchronization and remote invocation. We will study the basic concepts of aspect-oriented programming (an AOP system has five key ingredients) and how they relate to the concepts of adaptive programming.

Course Objectives: Learn new skills in software development which allow you to develop significantly more flexible software. Acquire a working understanding of AOP through the use of AspectJ. Acquire a clear understanding of the adaptive object-oriented paradigm through five architectural patterns; Structure-Shy Traversal, Selective Visitor, Structure-Shy Object, Class Graph and Growth Plan. Apply the aspect-oriented paradigm and the adaptive object-oriented paradigm to problems solving, including the implementation of a project. Understand the connections between the adaptive object-oriented approach and the aspect-oriented approach and how they fit into generative programming.

Qualified students are encouraged to produce a publishable paper as part of this course. A good example is an OOPSLA 2003 publication in the 3D track produced by students in this course. You are encouraged to read this paper early in the course: XAspects: Winter 2003 Quarter OOPSLA 2003 Publication .

```Course Information:

Tuesday 6-9 pm

Required text book: None. Reading material will be assigned
on a weekly basis.

``Adaptive Object-Oriented Software: The Demeter Method
with Propagation Patterns'' by
Karl J. Lieberherr, PWS Publishing Company, ISBN:           0-534-94602-X.

There are several books on AspectJ available. Choose one.

Prerequisites: CSG 111 or CSG 711.
Qualified students may take the class without this prerequisite.

Course requirements:

The grade will be based on an open-book midterm (twenty per cent) and
an open-book final examination (twenty per cent),
and the homework solutions (twenty percent)
which are primarily programming exercises
and the project
(forty percent).
In some courses, there is no final exam and the weight is given
to the project.

Homework: Five homeworks, one project

Teaching Assistant: To be announced.

Course URL:

Computer Language: Java and AspectJ.
AspectJ is learned in the course.
Other languages may be used: for example Scala from EPFL or Scheme.

Project: In the second half of the course you implement a project

```

You are encouraged to use AspectJ in your project. AspectJ is a popular extension to Java and an early version was developed by Dr. Crista Lopes, a former Northeastern PhD student, while working at PARC.

The use of DJ has been very successful and we will continue to initially use DJ to write adaptive programs in plain Java. DJ: A simple tool for Java programmers. Read the DJ Fact Sheet first.

First assignment: answer a questionnaire, and send me your answers by noon on Monday of second week of classes.

UML information you find on the web: Rational, OMG. Look for UML 1.3, the latest version now. UML 2.0 is in preparation.
Using Java compilers at CCS.
Course Directories .
Syllabus (Postscript).
Homeworks.
Links to individual project pages (under construction) .
Recent publications (they are sources for projects). Check the last few entries.
Project ideas.
Lecture Notes .
DemeterJ and AP-Studio Resources.
Course progress.
Old exams (Practice exams) .