The systems seminars is held on Thursdays
at 1:30 in 366 WVH (if available) or else 166 WVH,
during Fall, 2012 semester.
Systems is the subject of how to organize computations in the presence
of limited resources (which typically requires _system_ calls). In contrast
to software engineering, systems begins with a problem for which the
machine abstraction is deficient (where this means the machine
presented by the hardware/OS/compiler/run-time library),
and one must find novel ways to improve the abstraction at the
machine level.
We have discussed subjects ranging from parallel computing to overcoming
systems bottlenecks due to RAM/Network/Disk to reversible debugging via
checkpointing; and a variety of scientific applications. This area of
research typically involves a fine interplay among algorithms, software,
and applications.
-
Nov. 29, 2012
Title: The Recent Port of DMTCP to Android by 0xlab
Speaker: Gene Cooperman (work of Jim Huang and Kito Cheng)
-
About three weeks ago, the 0xlab
in Taiwan announced the port of DMTCP from Linux to
Android. The 0xlab has particular expertise in Android.
Jim Huang and Kito Cheng performed the work. Recall that DMTCP (Distributed MultiThreaded
CheckPointing) allows one to checkpoint a running process to a file, and
to later restart the process from that file. This was announced at the
Embedded Linux Conference Europe 2012, with
their slides here. "Hearing" about this work via a Google alert,
we wrote to them. The 0xlab team, similarly to the DMTCP team, is
committed to open source software, and the two groups are now discussing
a future collaboration. One of the barriers to the port to Android was
that Android uses Bionic libc, with subtle differences from Linux's use
of GNU libc. Impressively, the 0xlab team had analyzed some subtle DMTCP
internals without assistance from the DMTCP team. Simultaneously with
the work of 0xlab, Michael Coppola had developed a port of MTCP (the
lower layer of DMTCP) from Linux to Android --- work which we hope to
hear discussed in a future seminar. This talk describes the sociology
as well as the technical details of how ideas from two communities
(Android and checkpoint-restart) can lead to powerful synergies.
-
Nov. 15, 2012
Title: Extending DMTCP to support "attach" for Running Processes
Speaker: Jiajun Cao
-
This talk discusses progress in extending DMTCP to support an "attach"
mode similar to GDB's "attach" command. At this time, we are close to
fully supporting this behavior for MTCP (single processes, multi-threaded).
Later, the work will be extended to supporting DMTCP (multiple processes).
This work is joint with Gene Cooperman and Dick Fickling.
-
Nov. 8, 2012
Title: DMTCP Plugins and Aspect-Oriented Programming: A Comparison
Speaker: David Lorenz, Kapil Arya, and Gene Cooperman
-
The standard definition of an aspect is a cross-cutting concern.
A standard example occurs in multi-threaded program. A thread may create
a new thread, employ a mutex to synchronize between the two threads,
and then join with the the new thread on exit. These three pieces
of the program are scattered across the program. An aspect-oriented
program will move these three excerpts of code into a single piece of
aspect-oriented code, which interacts with the rest of the program.
DMTCP plugins have many of the same features as aspects. Kapil Arya
discussed DMTCP plugins in the talk of Oct. 18. David Lorenz is an
expert on aspects. He is a professor at the Open University (Israel).
The three speakers will together lead a discussion trying to understand
better the relationship between aspects and DMTCP plugins.
-
Oct. 25, 2012
Title: A Seminar on Kernel Exploitation
Speaker: Michael Coppola
-
Root is the penultimate goal of hackers. And what better way to obtain
it than by exploiting the kernel itself? In this presentation, we'll
discuss common bug classes in the Linux kernel and learn how to exploit
kernel vulnerabilities through live demonstrations, turning that $ into #.
-
Oct. 18, 2012
Title: Extensible User-space Process Virtualization
Speaker: Kapil Arya
-
This talk will present some of the problems with existing
checkpoint-restart solutions when dealing with various real
world artifacts. DMTCP, one of the most widely used user-space
checkpoint-restart solution, was recently updated to provide plugins
support. In the new architecture, the core checkpoint-restart model
assumes that all processes and threads are completely independent of
each other and outside world. Target applications are insulated from
the outside world through process virtualization. Communications with
the external world is handled through plugins. The checkpoint-restart
system grows organically by defining a plugin for each new abstraction
(pseudo-terminal devices, SysV shared memory, message queues, semaphores,
etc.).
-
Oct. 11, 2012
Title: Introductory Session
Speaker: Gene Cooperman
-
In this continuation of the previous week, students will continue
to give impromptu short talks.
-
Oct. 4, 2012
Title: Introductory Session
Speaker: Gene Cooperman
-
This first seminar of the Fall will review the various systems projects
underway in our group. There will be a special emphasis on introducing
those projects for anyone new to our seminar.
-
Apr. 19, 2012
Title: Overview of Adpting Cilk to FReD
Speaker: Samaneh Kazemi
-
Cilk is a new multithread language which extends C/C++ by introducing two
keywords, "spawn" and "sync". A typical bug in multithreaded programs is
a data race, which is hard to debug. Using FReD lets one execute Cilk
commands in the backward direction, and to debug it easier and faster.
However, Cilk uses inline assembly instructions to implement read/write
memory barriers. The first issue is to reimplement this using pthread
locks, so that one can use FReD wrappers for replaying the barriers.
A second issue for Cilk concerns its use of the rdtsc assembly instruction
to get a timestamp. FReD's use of DMTCP checkpoint-restart will not
virtualize 'rdtsc' properly. Since this instruction returns the number of
clock cycles since the process was restarted, its value will be zero at
the time of restart. Replacing 'rdtsc' with gettimeofday() function is
the second approach. By modifying Cilk source code based on the mentioned
approaches, Cilk works successfully on reverse-step and reverse-next for
some simple Cilk examples. Current work is concentrating on improving
FReD so that its reverse-watch and other commands can be applied to more
sophisticated programs.
-
Apr. 12, 2012
Title: Checkpointing a Virtual Machine and Its Implications
Speaker: Rohan Garg
-
Taking snapshots of the state of an operating system (OS) is one of the
de-facto features in most of the available Virtual Machine Monitors. In
this talk, we present the ability to checkpoint and restore the state
of a guest operating system running in a virtual machine monitor. This
gives us the benefit of having an almost "real-time" and transparent
record-replay log for a complete OS and any processes within it. This can,
potentially, have a significant impact on the techniques where virtual
machines have been employed for doing malware analysis and debugging.
We extend DMTCP, a Linux-based tool for transparent checkpoint-restart,
to enable check-pointing of Qemu running in user-space. Evaluations
with checkpointing of different Linux distributions and with checkpointing
of Microsoft Windows XP, each at different states of operation, show
promising results. Future work to extend this functionality for malware
analysis is also presented.
-
Mar. 15, 2012
Title: Migrating Parallel Algorithms to Hybrid CPU-GPU Clusters
Speaker: Vlad Slavici and Raghu Varier
-
For abstract, see talk below from Feb. 9, 2012.
-
Mar. 7, 2012 (2:30 p.m., 366 WVH) (unusual time)
Title: Migrating Parallel Algorithms to Hybrid CPU-GPU Clusters
Speaker: Vlad Slavici and Raghu Varier
-
For abstract, see talk below from Feb. 9, 2012.
-
Mar. 1, 2012
Title: DMTCP, checkpointing, and the ARM CPU
Speaker: Gene Cooperman
The ARM CPU is becoming widely used for general purpose Linux-based
computing. The widely used DMTCP package in Linux, for transparent
checkpoint-restart, has recently been ported from x86/x86-64 to ARM.
The port of DMTCP was made easier because DMTCP stays close to the
POSIX API. This explicit port of DMTCP will provide insights into which
are the few parts of DMTCP that are CPU architecture-specific.
-
Feb. 23, 2012
Title: User-Space Process Virtualization and DMTCP
Speaker: Kapil Arya
Virtualization in the context of Virtual Machines (VM) is a well-known
and proven technology. The virtual machine approach provides a
full-featured approach – but it is also a relatively heavyweight
approach, requiring large storage and checkpoint times on the scale of
a minute. For many applications, a lightweight solution to
virtualization is desired for better performance and finer grained
control. In this talk we will talk about lightweight process-level
virtualization. We will discuss the architecture needed to support
a kind of virtualization along with an implementation, DMTCP.
-
Feb. 16, 2012
Title: Data Quality and Query Cost in Pervasive Sensing Systems
Speaker: David Yates (Bentley University)
-
This research is motivated by large-scale pervasive sensing
applications. We examine the benefits and costs of caching data for such
applications. We propose and evaluate several approaches to querying for,
and then caching data in sensor field data servers. We show that for
some application requirements (i.e., when delay drives data quality),
policies that emulate cache hits by computing and returning approximate
values for sensor data yield a simultaneous quality improvement and
cost saving. This win-win is because when system delay is sufficiently
important, the benefit to both query cost and data quality achieved by
using approximate values outweighs the negativeimpact on quality due
to the approximation. In contrast, when data accuracy drives quality,
a linear trade-off between query cost and data quality emerges. We also
identify caching and lookup policies for which the sensor field query rate
is bounded when servicing an arbitrary workload of user queries. This
upper bound is achieved by having multiple user queries share the cost
of a single sensor field query. Finally, we demonstrate that our results
are robust to the manner in which the environment being monitored changes
using models for two different sensing systems.
Bio:
David is an Associate Professor of Computer Information Systems at Bentley
University. David's research areas include computer networking, data
communications, sensor networks, embedded systems, operating systems,
and computer architecture. With various colleagues, he holds several
U.S. patents for processes and systems related to computer networking,
content management, and mobile computing. David earned a PhD and MSc
from the University of Massachusetts. He also holds a BSc from Tufts
University.
-
Feb. 9, 2012
Title: Migrating Parallel Algorithms to Hybrid CPU-GPU Clusters
Speaker: Vlad Slavici
-
Hybrid CPU-GPU clusters are the fastest growing technology in
supercomputing today. It is expected that in a few months
the Titan CPU-GPU hybrid from Oak Ridge National
Laboratory (ORNL) will be the world's fastest supercomputer
(~20 PetaFLOPS). The main reasons for the fast adoption of GPUs
in HPC are the high 'performance to power' and 'performance to
cost' ratios.
We present a Programming Model that allows efficient migration
of compute-intensive parallel algorithms from CPU clusters to
Hybrid CPU-GPU clusters by increasing the flexibility of data
and control flow and allowing the runtime environment to adjust
the flow depending on the target architecture (in this case GPUs).
The programming model is demonstrated on MADNESS, a
high-performance scientific framework.
The challenges behind migrating legacy algorithms to the emerging
CPU-GPU architecture will be described, together with current and
proposed solutions. This work is joint with Raghu Varier.
-
Feb. 2, 2012
Title: No talk: see systems by Tony Skjellum at similar time, below:
-
Gibraltar: GPU-Accelerated Software RAID for HPC
Anthony Skjellum, University of Alabama at Birmingham
Date: Thursday, Feb 2
Time: 2:30 PM
Location: 442 Dana, ECE Dept.
In this talk, we present the jointly developed UAB-Sandia Gibraltar
library, that commoditizes RAID by first choosing software RAID, then
performs middleware acceleration on that RAID using NVIDIA-based GPUs.
The ability tocompute parity stripes at full rate with 2,3, 4, or
more parity disks allowsstorage to become survivable for long periods
of time, and particularly can address triple-disk failure and batch
correlated failures. Gibraltar RAID is discussed, and the error rates that
motivates this work are discussed as well. We are continuing to develop
and extend this technology, which initially was designed to serve blocks
via iSCSI and iSER. Current and future plans,including local striping
for multi-cloud-vendor storage are mentioned. Current estimates are that
practical petabytes of storage can be constructed for under $200,000,
including a large amount of excess computational power, to enable
"computation close to the storage." The general area of middleware
acceleration through GP-GPU computing is also discussed.
-
Jan. 26, 2012
Title: State of the Lab
Place: Probably 370 WVH; 164 WVH if 370 WVH is occupied
Speaker: Gene Cooperman
-
This talk will present a state-of-the-lab description
about where we are, and where we are going. As always, this is a public
seminar, and everybody is welcome.
-
Dec. 6, 2011
Title: Source-Level Transformation of Legacy Sequential Program into Scalable
Thread-Parallel Software
Speaker: Xin Dong
-
This dissertation will focus on code transformation for legacy
sequential programs toward thread-level parallelism on many-core
computers. It highlights a semi-automatic method: Scalable Task-Oriented
Code Transformation (STOCT). STOCT introduces thread parallelism for
sequential code via a series of code transformation steps. STOCT's
goal is: Given a large C/C++ software program, transform the source
code to replace several independent copies of a sequential process
with an equivalent single process consisting of several threads.
This takes advantage of many-core computers in a memory-efficient
and scalable manner.
The STOCT methodology is implemented in the parallelization work for
three widely varying applications: (i) a simulation toolkit Geant4;
(ii) a linear system solver based on ILU(k) preconditioning; and
(iii) a large-scale data analysis tool AliRoot. STOCT bridges the gap
from sequential programming to concurrent programming by decomposing the
parallelization into four subgoals: to exert more parallelism, to share
more data, to guarantee correctness, and to achieve scalability.
STOCT has the following features: (i) to pursue thread safety by
maximal parallelism in which only the minimal shared data set is allowed;
(ii) to reduce the memory footprint by sharing some user-level data
whose value is no longer changed after initialization; (iii) to
guarantee runtime correctness, which is weaker and yet strong enough for
production runs; and (iv) to achieve scalability by eliminating two
serious performance bottlenecks special for shared-memory computation:
memory allocation/deallocation and cache coherence miss.
-
Nov. 29, 2011
Title: Temporal Meta-Programming: Treating Time as a Spatial Dimension
Speaker: Ana-Maria Visan
-
Reversible debuggers have existed since the early 1970s. However, they are not
widely used, with the possible exception of GDB. Even GDB is worth using for
only a few seconds. We hypothesize that the lack of success of reversible
debuggers is due to the fact that executing backwards in search of an
underlying bug often requires manual execution of thousands of reverse commands
and is often impractical.
The challenges of building a practical reversible debugger will be
discussed, first for the simpler case of a single-threaded application
and then for multithreaded applications. Reversibly debugging
multithreaded applications on today's multi-core hardware is a
daunting task due to the lack of determinism. Two reversible
debuggers, URDB and FReD will demonstrate the feasibility of our
approach.
We also propose a new generation of tools on top of a reversible
debugging platform. The tools will automate the temporal search for
the earlier underlying bug. A particular tool, reverse expression
watchpoint will be emphasized as a particularly useful key for this
automation.
-
Nov. 22, 2011
Title: A Programming Model for hybrid CPU-GPU Clusters
Speaker: Vlad Slavici
-
In the past few years, Graphical Processing Units (GPUs) have become
increasingly present in the High-Performance Computing (HPC) community. Three
of the Top 10 supercomputers revealed last week at Supercomputing (SC11) are
hybrid CPU-GPU clusters (the most performant at #2), whereas the top spot
is retained by K (Japan), with a measured peformance of ~ 10 PFLOPS on
Linpack.
It is expected that by next year the top spot will be held by a CPU-GPU hybrid
(the upcoming Titan from Oak Ridge National Laboratory (ORNL), which will be
using NVIDIA Tesla, Fermi generation, is expected to peak at 20 PFLOPS).
The main reasons for the fast adoption of GPUs in HPC are the
high 'performance to power' and 'performance to cost' ratios.
However, the migration to hybrid CPU-GPU clusters brings with it a significant
challenge: both new and legacy software and applications have to (re)written to
comply with GPU programming models.
This talk will present a general programming model for hybrid CPU-GPU clusters
that is currently being employed by the speaker and Raghu Varier (ECE,
Northeastern U.) to efficiently adapt legacy code in
MADNESS
(a high-performance scientific library and framework) for CPU-GPU clusters.
-
Nov. 15, 2011 (10:30 a.m.: note unusual time)
Title: Deferred in favor of external speaker from Symantec
Speaker: Marc Dacier, Symatec (host: Engin Kirda)
-
There is a talk at 10:30 a.m., hosted by Engin Kirda. This talk
is not part of the systems seminar series. But due to
its close interest to that of the systems seminar, there will
be no systems seminar. The title and abstract are being forwarded
to the systems seminar mailing list.
-
Nov. 8, 2011 (note: may start a little after 1:30)
Title: Green HPC: A Systems Approach to Energy Efficient Datacenters
Speaker: Kurt Keville (MIT)
-
Green HPC is the new standard for High Performance Computing (HPC).
This has now become the primary interest among HPC hardware
researchers because of a renewed emphasis on Total Cost of Ownership
(TCO) and the pursuit of higher performance. In this seminar, we
analyze the problem holistically and describe best practice fixes to
solve the problems of energy-inefficient HPC.
-
Nov. 1, 2011
Title: Feedback on FReD
Speaker: Gene Cooperman
-
As a result of presenting FReD at Indiana University and Boston University,
this speaker received a certain amount of feedback. This talk will include
an abbreviated persentation of the FReD talk that was given, along with some
examples of useful feedback and potential collaborations. As time permits,
we will also jointly review the current status of the FReD project.
-
Oct. 25, 2011
Title: Research Exchange with Mathematics Department
Speaker: Gene Cooperman, Alex Suciu, Jerzy Weyman
-
Professors and students will present computational research projects
from each of the two departments (Computer Science and Mathematics).
Professors Suciu and Weyman and their students will be speaking
from the Mathematics Department. Each of the students in the
High Performance Computing group should also be prepared to give
a short, informal presentation.
-
Oct. 11, 2011 (2:30 p.m. -- NOTE UNUSUAL TIME)
Title: URDB: A Universal Reversible Debugger Based on Decomposing Debugging
Histories
Speaker: Ana-Maria Visan
-
Reversible debuggers have existed since the early 1970s. A novel approach,
URDB (Universal Reversible DeBugger), is presented based on checkpoint,
restart, re-execute and decomposition of debugging histories. URDB adds
reversibility to a debugger, while still placing the end user within the
familiar environment of their preferred debugger. The URDB software layer
currently includes modes that understand the syntax for four debuggers: GDB
for C/C++/Java/Fortran, Python (pdb), MATLAB, and Perl (perl -d). It does
so by adding a thin URDB software layer on top of the DMTCP checkpoint-restart
package (through checkpointing the Linux ptrace system call). URDB passes
native debugging commands between the end user and the underlying debugging
session. URDB models the four common debugging primitives that form the basis
for most debuggers: step, next, continue, break. For example, given a debugging
history of the form [step, next, step], URDB's reverse-step produces a new
history, [step, next] (this is equivalent to an undo command). Further, subtle
algorithms are described for the more interesting cases that are not undo
commands (e.g. [continue, next], followed by a reverse-step).
-
Oct. 4, 2011
Title: State of the High Performance Computing Laboratory
Speaker: Gene Cooperman
-
This talk will be an overview of the activities of the High Performance
Computing Laboratory. Items of interest will include:
- ongoing work with FReD (Fast Reversible Debugger);
- collaboration with Oak Ridge National Laboratory
(Madness
software);
- cooperation with NeuroDebian
(easy checkpointing built into NeuroDebian for thousands of users);
- mature multi-threaded Geant4MT (plans for imminent distribution);
- checkpointing of Cilk
using DMTCP (anyone interested?);
- checkpointing of Qemu
using DMTCP (anyone interested?);
- issues in making Condor
work with DMTCP and compiled Matlab;
- the status of the DMTCP module system (now robust, record-replay module,
many new module events defined);
- creating the missing DMTCP wrappers (epoll, inotify, and eventfd);
- DMTCP and OpenMP
(a more robust tid virtualization);
- thread-safe trampolines for DMTCP;
- DMTCP-based process migration between Linux kernel versions
(subtle issues; strong interest at Cadence
corporation (anyone interested?));
- other opportunities (DMTCP "attach" based on new ptrace feature allowing
injection of new threads); (DMTCP and checkpointing
OpenGL 3-D graphics)
- DMTCP and the distros (DMTCP in Debian testing;
DMTCP in OpenSuseFactory;
better packaging as candidate for Fedora);
- issues of DMTCP checkpointing and supercomputing (IBM Blue Gene, upcoming
Cray Titan at ORNL).
Issues concerned with FReD include:
making memory-accurate record-replay robust (now working);
making FReD work with Firefox and Apache (in addition to MySQL); and
extending the reverse-watch algorithm to cases where an auxiliary
thread (not the primary thread) causes the main expression to change.
Issues concerned with the Madness software include:
a new abstraction for "Big Data" for Madness;
checkpointing over Infiniband; and
reversibility for real-world software (Madness).
-
Sept. 13, 2011
Title: Compression of Program State Checkpoints
Speaker: Artem Y. Polyakov
-
Complexity of modern distributed systems raises new issues. Fault
tolerance is one of them. Checkpointing of distributed program state
is one of the main approaches to achieve fault tolerance. The major
drawback of this technology is heavy I/O load that occurs during
creation of a checkpoint. This issue is addressed in this work using
a compression approach. Delta-compression and compression by
traditional programs (e.g. gzip) are used for this purpose.
Results and comparison with existing approaches are given.
-
Apr. 21, 2011
Title: Computational Kernels for Space-Limited Applications: Computing
with Parallel Secondary Storage
Speaker: Vlad Slavici
-
-
Apr. 14, 2011
Title: Solving the Double Swap Problem: How Multiple
Virtual Machines Can Share the RAM Better
Speaker: Kapil Arya
-
In a virtual machine in which a guest OS is running applications,
a well-known problem is that of double-swapping. Given ten virtual
machines on a host, most customers will not provision the host with ten
times as much physical memory. So, in order to accommodate more and more
virtual machines on a single host, most modern hypervisors overcommit
the available RAM. They virtualize the physical RAM to make each virtual
machine believe that it has the full amount of physical RAM. Two standard
techniques for accommodating this are page sharing and ballooning.
But when the memory runs really low, the hypervisor has to fall back on
transparently swapping (a.k.a. paging) out to disk the virtual machines'
memory. However, the guest OS, unaware of hypervisor swapping, may
overcommit its own physical memory and resort to swapping out that
same memory. Now the hypervisor needs to swap the memory back in to
allow the guest OS to swap it out. This results in a double swap.
In this talk, I will present the work done by me under supervision of
Alex Garthwaite at VMware over the last three summers for finding the
first fully working solution to the double swap problem that does not
requires para-virtualization.
-
Apr. 7, 2011
Title: So You Replayed your Threads and Now They're Hanging --- What Next?
Speaker: Xin Dong
-
The recent work on a reversible debugger ("FReD: A Fast Reversible
Debugger", by Visan et al.) is able to record most system calls
and synchronization operations for the execution of multi-threaded
programs and replay, while enforcing the original order of execution
of those events. It leads to a style of weak determinism that
suffices to facilitate debugging and testing for the majority of
parallel programs. For example, the previous work had demonstrated
deterministic replay for 10 of the 12 programs of the Splash-2 test suite, and
found that an assumption of weak determinism sufficed to replay 8 of
the 12 programs. However, some parallel computations (such as
the remaining 4 SPLASH-2 programs) include intentional data races,
in which FReD may fail to reproduce the recorded execution.
This talk illustrates some of the benign synchronization patterns that
nevertheless lead to replay failure for remaining 4 programs. This
work also discusses where a programmer can enforce replay determinism
-- in some cases by adding additional locks to the source code.
The modified code lead to successful replay of the 3-1/2 of the
remaining 4 programs of the SPLASH-2 test suite.
-
Mar. 10, 2011
Title: Trip Report: Portland State University
Speaker: Gene Cooperman
-
Portland State University is one
of three major universities in Oregon. They have on their faculty
a Turing award winner (Ivan Sutherland, technical head of the Asynchronous Research Center), and
two ACM Fellows, David Maier and Bryant York. They also have a large group
of programming language researchers interested in Haskell (many of whom
came from the old Oregon Graduate Institute, OGI). I spoke there on FReD
(Fast Reversible Debugger). This talk will review some of the discussions
and insights from the trip.
-
Mar. 3, 2011
Title: Caracal: Dynamic Translation of Runtime Environments for GPUs
Speaker: Rodrigo Dominguez
-
Graphic Processing Units have become the platform of choice for
accelerating a large range of data parallel and task parallel
applications. Both AMD and NVIDIA have developed GPU implementations
targeted at the high performance computing market. The rapid adoption of
GPU computing has been greatly aided by the introduction of high-level
programming environments such as NVIDIA's CUDA C and Khronos's
OpenCL. Given the fact that CUDA C has been on the market for a
number of years, a large number of applications have been developed in
CUDA by developers in the HPC community.
In this presentation we describe Caracal, our implementation of a
dynamic translation framework that allows CUDA C programs to run on
alternative GPU and CPU platforms. In this presentation we target the AMD
Evergreen family of GPUs. We discuss the challenges of compatibility
and correctness faced by the translator using specific examples. We
also compare the quality of the code generated by our framework with
that produced by the AMD OpenCL library. Our dynamically translated
code performs comparably to the OpenCL library and provides a vehicle
for evaluating compiler optimizations in the future.
-
Feb. 17, 2011
Title: New Ideas for the Roomy-Based Binary Decision Diagram Package
Speaker: Vlad Slavici
-
The Roomy-based binary decision diagram (BDD) package has demonstrated
that operations with very large BDDs can be efficient in a
parallel external memory environment. Operations like
apply, SAT-count or any-SAT are already implemented in the package.
However, previous attempts at developing efficient reachable sets
programs or dynamic variable reordering for very large BDDs resulted
in algorithms that were too slow for typical use cases.
This talk will briefly review these previous attempts and then
describe new ideas for algorithms which could make these two
operations efficient for parallel disk-based computing.
-
Feb. 10, 2011
Title: Checkpointing OpenMPI with MTCP
Speaker: Alex Brick
-
OpenMPI is a message-passing
infrastructure that is used by many distributed and high-performance
computations. However, in addition to communicating between processes
and nodes in a computation, it provides other facilities, including
the ability of computations to be checkpointed. This talk will discuss
how OpenMPI handles checkpointing currently, and what has been done in
order to allow OpenMPI to use MTCP transparently internally. MTCP is the
single-process component of the larger DMTCP software package. The talk
will also include a discussion of the problems that were encountered
in interfacing with the OpenMPI system and some of the design questions
that it has raised for the future of MTCP.
-
Feb. 2, 2011
Title: Measuring and Predicting SSD Performance
Speaker: Peter Desnoyers
-
The talk presents the architecture of current Flash SSDs (NAND), along with
common algorithms for the Flash Translation Layer (FTL)
and opportunities for improvements through an O/S-based Flash filesystem.
-
Jan. 27, 2011
Title: Trip Report: Oak Ridge National Laboratory
Speaker: Gene Cooperman
-
Oak Ridge National Laboratory is
a national research laboratory that receives much of its funding
from the U.S. Department of Energy. It is a center of high-end
computational science in areas ranging from climate modelling
to neutron science to bioinformatics. I recently gave a talk there
on FReD (Fast Reversible Debugger), and also discussed the needs of
HPC for checkpointing. Most of our work at Northeastern to date has
concentrated on low-end and medium computational science. The needs of
high-end computing for better debugging support are at least as large,
but are reflected through different technologies. For example, at the
extreme high-end, one may prefer application-level checkpointing of only
the needed data, and tricks like narrowing (converting double precision
to single precision) to reduce disk bandwidth needs. To what extent
could nonvolatile semiconductor memory ease the checkpointing problem
for exascale computing? In addition, GPU computing is growing fast.
How does one reversibly debug (or even just checkpoint) a GPU computation?
What are the high-end applications needing large intermediate storage
(e.g., needing Roomy)?
-
Jan. 20, 2011
Title: Algorithms for Large Finite State Automata
Speaker: Vlad Slavici
-
Finite state automata (FSA) are ubiquitous in computer science. Network
packet filtering, pattern matching, speech recognition and lexical
analysis are just a few areas in which FSA are used. FSA are so widespread
because they express regular expressions/regular languages. The fact
that the class of deterministic finite state automata (DFA) and the
class of non-deterministic finite state automata (NFA) are equivalent is
part of the foundation of theory of computation. Converting an NFA to
a DFA (by the well-known subset construction) leads to a combinatorial
explosion of the number of states. This talk will focus on the case when
the combinatorial explosion of states cannot be kept in RAM. Roomy was
used to store the states on the aggregate disks of a commodity cluster.
The resulting DFA was then minimized by an forward refinement technique
adapted for disk. The technique is based on a RAM-based parallel
algorithms from the 1990s. The parallel disk-based computations extended
previous RAM-based results by four new cases, and provided strong
experimental evidence for a new conjecture for a series of theoretical
sorting problem expressed as a series of token passing networks.
-
Jan. 13, 2011
Title: State of the High Performance Computing Lab
Speaker: Gene Cooperman
-
As we start a new calendar year, this seems to be a good time to
take stock of where we are, and the roadmap going forward. I will
lead a discussion of the status and future roadmap for
DMTCP / Roomy / Geant4MT (multi-threaded), and my informal plan is for
Peter Desnoyers to similarly lead a discussion for those projects
that he is leading.
-
Dec. 9, 2010
Title: Roomy: A New Approach to Parallel Disk-based Computation
Speaker: Daniel Kunkle
-
Disk-based computation provides a major new application of disks in addition to
the three traditional uses: file systems, databases, and virtual memory.
Disk-based computation uses the many disks already available in a computational
cluster as main memory. In doing so, disk is elevated to a level normally
reserved for RAM, and applications are provided with several orders of
magnitude more working space for the same price.
Unfortunately, developing efficient parallel disk-based applications
is often a difficult task. To solve this problem, this talk introduces
Roomy, a new programming language extension and open source library for
writing parallel disk-based applications.
The core of this research is in developing a library of general algorithms and
larger applications on top of Roomy. Applications considered include: a proof
that 26 moves suffice for Rubik's Cube; general breadth-first search
algorithms; and a package for very large binary decision diagrams.
In developing these applications, the two central questions addressed are:
- What is the class of applications for which parallel disk-based computing
is practical?
- How can existing sequential algorithms and software be adapted to take
advantage of parallel disk-based computing?
-
Dec. 2, 2010
Title: Using Disks-As-RAM Systems for Pattern Search?
Speaker: Mirek Riedewald
-
We will discuss the properties of a class of pattern search techniques,
focusing +on data size and general access patterns. These techniques first
train data +mining models from a given large data set, then go through
multiple iterations +of creation and ranking of huge collections of model
summaries. The current +implementation runs on the Hadoop MapReduce system
and our goal is to explore +if disks-as-RAM approaches like Roomy would
be a good fit as well.
-
Nov. 4, 2010
Title: Functionality in the New DMTCP 1.2.0 Release
Speaker: Kapil Arya
-
DMTCP is Distributed MultiThreaded CheckPointing.
This release is a semi-major release: from
DMTCP 1.1.9 to 1.2.0. As part of the change from 1.1 to 1.2,
we will include support for the well-known GNU 'screen', which is commonly
used as a framework for all of the processes of a single programming
session. It will continue to checkpoint such software as OpenMPI
Matlab, etc. In order to complete this work, the design of many of the
subsystems of DMTCP had to be reviewed. In one outcome, we now envision
for the 1.2.x series a new model of connection objects --- with single
unique connection ids for each connection. Many wrappers around socket
calls will be replaced by lazy creation of connection obects at the time
of a fork. (Fork and dup are the primary places where a program can create
shared file or socket descriptors.)
-
Oct. 27, 2010
Title: A Novel Approach to Prime Factorization (tentative)
Speaker: Bryant York
-
The originally scheduled talk by Kapil Arya was delayed one week due
to the surprise visit of Prof. Bryant York of Portland State University.
The title above is our own, but hopefully it fits the spirit of the talk.
-
Oct. 20, 2010
Title: Stable Parallel ILU(k) Preconditioning for Iterative Linear Solvers
Speaker: Xin Dong
-
Level-based incomplete LU factorization (ILU(k)) is a preconditioner
used for many stable iterative linear solvers. Its parallelization is
interesting for two reasons: domain-decomposition algorithms dominate even
though they are unstable; and stable parallel ILU(k) preconditioners were
not well-explored since their appearance in the 1990s. Current parallel
implementations can even succeed with a small number of CPU cores,
and then fail when given more cores to work with. Our work targets
this problem. Even though this problem seems trivial at first glance,
the line of work continued off and on over five years. In this research,
we take advange of task-oriented parallel C (TOP-C) and produce the
first efficiently parallelized ILU(k) preconditioner that maintains the
stability property.
The new algorithm is called TPILU(k). Normally, TPILU(k) returns exactly
the same preconditioned matrix as the original sequential ILU(k)
preconditioner would have returne, thus directly guaranteeing stability.
As an optimization, TPILU(k) attempts
to use a new incomplete inverse method as a fast approximation for the
original ILU(k) preconditioned matrix. This allows TPILU(k)-based solvers
to compete with fast, unstable parallel solvers, such as Euclid parallel
ILU(k) and parallel Block Jacobi ILU(k). This work is joint with
Gene Cooperman.
-
Oct. 14, 2010
Title: A Parallel Disk-based BDD Package
Speaker: Vlad Slavici
-
A Binary Decision Diagram (BDD) is a data structure which is widely
used in formal verification for model checking. Real-world computations
with BDDs often exhaust available RAM on one machine or even a cluster
of computers. Extensive BDD-related research in the past 20 years has
focused on minimizing the size of BDDs and/or making use of distributed
RAM or a single local disk to increase the available storage space for
computations. Our group has taken a new approach to developing a BDD
package: using parallel disks as main memory instead of RAM for large
computations with BDDs. The algorithms that sit at the foundation of
this BDD package will be discussed. This is joint work with Daniel
Kunkle and Gene Cooperman.
-
Oct. 7, 2010
Title: Status Update on DMTCP, and Reversible Debugging
Speaker: Kapil Arya and the rest of the team
-
There continue to be interesting new developments both in DMTCP
and reversible debugging. This will be a shorter talk, providing
a status update.
-
Sept. 30, 2010
Title: Hybrid main memory with DRAM and Flash
Speaker: Xiaohai Yu
-
Recent memory technology trends have enabled the use of hybrid
DRAM / non-volatile (e.g. flash) main memory systems. Current
operating systems lack support for hybrid memory, however, and will
require new algorithms in order to make effective use of them. Our work
focus on software level memory management support for such hybrid
model. We have implemented a simulation model on current Linux kernel,
and based on it we are now proposing new algorithms for hybrid model.
-
Sept. 23, 2010
Title: The past, present and future of URDB (Universal Reversible Debugger)
Speaker: URDB team (Ana Visan, Kapil Arya, Gene Cooperman,
Tyler Denniston, and Xin Dong)
-
URDB is a reversible debugger that will soon support deterministic replay
on multi-core hardware. The main points of novelty of URDB are:
(1) the ability to do this at near native speeds on real world software.
(2) debugging compiled languages via gdb, as well as built-in debuggers
for interpreted languages such as MATLAB, python, and perl.
(3) a "reverse-watchpoint" capability that automatically uses binary
search over the process life time to locate bugs.
(4) employing a novel implementation based on an extension of the
transparent, user-level, fast checkpointing package, DMTCP (Distributed
MultiThreaded Checkpointing Package). To our knowledge, DMTCP is the first
package to checkpoint ptrace-based debuggers such as gdb.
This talk will cover the history of the URDB project:
how the project started, what were its original goals and
how these goals evolved over time. Special emphasis will be given
to the latest work on implementing determinism for multi-core architectures.
-
Sept. 16, 2010
NON-STANDARD PLACE; 166 West Village H
Title: Status and Overview of DMTCP
Speaker: Gene Cooperman
-
DMTCP (Distributed MultiThreaded
CheckPointing) is a transparent checkpointing package. It saves the
state of a running computation (which may include multi-threaded
processes on distributed computers) to a checkpoint image file.
It operates on such diverse applications as Matlab, Maple, MySQL, OpenMPI
(for parallel programming), parallel Geant4 (used in CERN large hadron
collider experiments), Gap
(Groups, Algorithms and Programming),
Sage
(distributed algebraic computation),
Gcl (GNU Common Lisp), python,
perl, bash, emacs, vim and others (and very soon
including
screen).
This talk will review the status, and provide a brief overview of DMTCP.
There are about ten currently active DMTCP developers, including two
other universities. Some of the current activities include:
- the ongoing collaboration with
the Condor
project (high throughput batch facility for pools of computers);
- collaboration with some Condor users to support Condor/DMTCP/Matlab
(which is in demand, since Condor's built-in checkpointing does not
support this configuration);
- the use of the NMI Build
and Test Lab to automate testing under a wide variety of
different Linux platforms;
- ongoing collaboration with the
OpenMPI projectto support DMTCP-based checkpointing
(MPI is "Message Passing Interface", a standard for parallel computing);
- porting of DMTCP to provide checkpointing
of Infiniband programs
(a newer interconnection network widely used for high performance
computation);
- development of checkpointing for X-Windows graphics programs,
and possibly OpenGL in the future (This application is of special
interest for the
SCIRun project, a problem solving environment for modeling,
simulation and visualization of scientific problems, and particularly medical
imaging problems);
- past collaborations for checkpointing with several user groups
(a million CPU-hour application on a Cray XT supercomputer),
the use of DMTCP for the
iPlant Tree of Life project (see here
for more information), the use of DMTCP in
HOL Light
(Higher Order Logic: mathematical theorem proving), and other users.
-
Apr. 22, 2010
Title: Transforming a Checkpoint Image into an ELF Executable
Speaker: Amruta Chougule
-
DMTCP is a checkpointing application. When a user executes a checkpoint command,
a checkpoint thread associated with the process writes the process state to a
file. The DMTCP restart routine is used to restart the process using the
checkpoint file. This checkpoint image can be converted into an ELF binary,
so that one can directly run the resulting binary as a self-extracting
executable to resume the process later on.
This work is joint with Shobhit Agarwal.
-
Apr. 8, 2010
Title: The Role of the Flash Translation Layer in NAND Flash
Speaker: Yerong Chen
-
NAND Flash has been enjoying wide popularity due to its beneficial
characteristics, e.g. lighter weight, lower power consumption, shock
resistance, faster access time and non-volatility. But unlike traditional
block disks, data in flash can not be overwritten. Instead, a time-consuming
erase operation on the used blocks must proceed before it can be written
again. To hide internal characteristics and increase the performance and
lifespan of flash, the Flash Translation Layer (FTL) is introduced. My talk will
compare different existing FTLs according to the address mapping scheme they
employ. I will analyze their performance for certain operations and see what
we can learn from their design.
-
Mar. 25, 2010
Title: A Parallel Disk-based Binary Decision Diagram Package
(and short talk on: "Converting a DMTCP Checkpoint Image to
a Binary Executable")
Speaker: Dan Kunkle and Vlad Slavici
(and Amruta Chougule for short talk)
-
A Binary Decision Diagram (BDD) is a data structure which allows
representing boolean formulas in a compact way. BDDs are mainly used in
formal verification (model checking) as well as in other areas. Donald
Knuth's upcoming Volume 4 of "The Art of Computer Programming" is mainly
dedicated to BDDs (and their close relatives).
Even though more compact than other boolean formula representations, BDDs
often grow to sizes too large for one RAM subsystem or even parallel
RAM. Furthermore, BDD operations have proved difficult to parallelize.
We propose a parallel disk-based BDD package which aims to overcome the
space limitations, as well as taking a different approach to parallelism
compared to existing parallel BDD packages. The approach is based on
BFS algorithms.
SHORT TALK: Amruta Chougule will present a brief overview of the steps
in using a knowledge of ELF to convert a DMTCP checkpoint image into
an ELF executable. Questions are welcome. This is preparatory
to a longer talk later in the semester, which will provide further details.
This work is joint with Shobhit Agarwal.
-
Mar. 11, 2010
Title: Transparent Access to Remote GPUs using Virtualization
Speaker: Dana Schaa (ECE at N.U.)
-
A major issue for multi-GPU development is gaining access to the
desired GPU hardware. Factors such as available bus slots, power
supply, and cost all limit GPU resources on a workstation or laptop.
The current GPU programming paradigm needs to be considered as well,
and has ramifications on alternatives related to distributed
programming. To alleviate this problem, we are planning a
virtualization framework to provide transparent access to remote GPUs.
The framework will allow an OS running on any system (even a laptop)
to believe that virtually limitless GPU resources are directly
available to it, while being completely transparent to the programmer
and programming model.
This is a new research project. So questions and suggestions would be
highly appreciated.
-
Feb. 18, 2010
NON-STANDARD PLACE; 147 Ryder Hall
Title: Write Endurance in Flash Drives: Measurements and Analysis
Speaker: Marcela Simona Boboila
-
In recent years flash memory has entered widespread use, in embedded
media players, photography, portable drives, and solid-state disks
(SSDs) for traditional computing storage. Flash memory has become the
first competitor to magnetic disk storage to gain significant
commercial acceptance. However, it differs from disk in many
characteristics; one which has particular importance for the design of
storage systems, is its limited write endurance: bits in a flash chip
fail after a limited number of writes, eventually rendering the device
unusable.
In this work, we investigate internal characteristics of flash
devices. The focus is not only measurement results, but the underlying
factors at the level of chips, algorithms in use, and achievable
algorithms which determine the endurance of a device. We
reverse-engineered a few flash drives, and found a close correlation
between measured whole-device endurance, and predictions based on chip
endurance values and internal algorithms. We present methods based on
analysis of operation latency which provide a non-intrusive mechanism
for determining internal parameters. We discuss implications for
storage systems, and in particular the necessity of new scheduling
algorithms designed specifically for flash, to extract maximum
performance.
-
Feb. 4, 2010
Title: URDB: Universal Reversible Debugger
Speaker: Ana Visan and Kapil Arya
-
This week's talk will consist of two shorter talks. In the first
half, Ana will talk about URDB. In the second half, Kapil will
describe the currrent set of DMTCP bugs (all presumed to be relatively small)
prior to having DMTCP validated as the checkpointer for
Condor. A full
abstract follows for the first half.
URDB is a universal reversible debugger that adds reversibility to almost any
debugger through a python-based wrapper. URDB is still experimental. URDB
supports multi-threaded and multi-process applications, as well as distributed
computations. It currently has four personality modules, which support: gdb;
MATLAB; python (pdb); and perl (perl -d). URDB gains its reversibility through
the use of DMTCP (Distributed MultiThreaded CheckPointing), a fast
checkpoint-restart package.
URDB also supports reverse expression watchpoints, a form of temporal search
within a process lifetime. For example, the current value of a user specified
expression indicates a bug. The user asks URDB/gdb to go back in time to a
statement where the expression is about to take on its current value. This uses
binary search: if n statements have been executed, URDB finds the point in time
using only log(n) evaluations of the expression.
The experimental software is available at:
http://urdb.sourceforge.net/.
-
Jan. 28, 2010
Title: Fast Multiplication of Large Permutations
for 2-level Memory Hierarchies
Speaker: Vlad Slavici
-
Permutation multiplication (or permutation composition) is an operation
which allows for a very simple, traditional implementation:
for i in {0 . . . N-1} do Z[i] = Y[X[i]]
For 2-level memory hierarchies such as RAM/disk, RAM/flash or
parallel RAM/ parallel disk this algorithm is unfeasible for large permutations,
because of the penalty incurred by a random access to an array stored on the
lower-level memory. New algorithms for permutation multiplication
that are feasible for the 2-level memory hierarchies mentioned above will be
presented.
Surprisingly, the traditional algorithm in its multi-threaded version
s not the fastest one for large permutations (at least 128 MB) even
when the 2-level memory hierarchy is cache/RAM on a many-core machine.
A new multi-threaded algorithm will be presented, which outperforms the
traditional
algorithm when running on at machine with at least 8 cores. In a special
regime, the new algorithm outperforms the traditional one for permutations
as small as 16 MB. Strong evidence that for future commodity machines
the new algorithm will be faster than the traditional one will also be
presented.
(This is joint work with Xin Dong, Daniel Kunkle and Gene Cooperman.)
-
Jan. 21, 2010
Title: Organizational Meeting for Spring, 2010
Speaker: Gene Cooperman (and others)
-
This will be an organizational meeting in which we can each
report what we are currently working on. Gene will lead the
seminar.
I apologize that I am most familiar with the work of my own students.
Hence the list of short topics below is missing the important
status and contributions of Peter Desnoyer's students
(e.g. congratulations on the FAST publication), and whoever else attends.
Among the short topics in my own group are:
ISSAC work (symb. alg.), potential collaboration w/ Moreno Maza and Xie.
Future of project w/ Mueller (J_4)
Success of Geant4MT (multi-threaded project), and future
Plans for URDB/DMTCP
Roomy has now gone beta
Other??
Other groups
Other topics
-
Friday, Dec. 18, 2009 (3:00 p.m.: NON-STANDARD TIME; 366-WVH
Title: Brief Report on How to Obtain Full Scalability for Multi-Threaded Geant4;
and Brief Report on Dagstuhl seminar
Speaker: Xin Dong and Gene Cooperman
-
This Friday at 3:00 p.m. (note unusual time) we will have two brief reports
to mark the end of the semester for those who are still on campus.
Xin Dong will report on his final solution to obtain full linear
scalability for multi-threaded Geant4, and how he found it. Gene Cooperman
will report on the Dagstuhl meeting, Search Graph Engineering.
-
Nov. 11, 2009
Title: Computing Cylindrical Algebraic Decomposition via
Triangular Decomposition
Speaker: Marc Moreno Maza (U. Western Ontario)
(Joint work with Changbo Chen, Bican Xia and Lu Yang
-
Cylindrical algebraic decomposition is one of the most
important tools for computing with semi-algebraic sets,
while triangular decomposition is among the most important
approaches for manipulating constructible sets. In this talk,
for an arbitrary finite set $F \subset {\R}[y_1, \ldots, y_n]$
we apply comprehensive triangular decomposition in order to obtain an
$F$-invariant cylindrical decomposition of
the $n$-dimensional complex space, from which we extract
an $F$-invariant cylindrical
algebraic decomposition of the $n$-dimensional real spa\-ce.
We report on an implementation of this new approach for constructing
cylindrical algebraic decompositions.
-
Nov. 4, 2009
Title: Leveraging Social Networks in Information Systems
Alan Mislove
-
Recently, online social networking sites have exploded in popularity;
numerous sites are dedicated to exchanging end-user generated content,
such as photos on Flickr, videos on YouTube, and status updates on
Twitter. These sites represent a new type of information system,
where links between users (as opposed to links between content items)
are the primary manner in which information is structured. But, the
enabling of all users to become publishers is resulting in a deluge of
data that only makes the fundamental problems of finding relevant and
trustworthy content even harder. The social networks that these sites
are based on, however, offer a potential solution, because the
relationships among users in the social network may indicate the
relevance and trustworthiness of content.
In this talk, I provide an overview of my research in this space.
First, I detail my work on the measurement and analysis of real-world
networks, focusing on techniques to infer hidden attributes of social
network users. Second, I describe approaches leveraging social
networks to solve some of the challenges facing information systems.
I present Ostra, a novel system that prevents unwanted communication
by leveraging the relationships among users in a social network. I
conclude with a brief overview of ongoing research projects.
-
Oct. 29, 2009
Title: The URDB reversible debugger: Visions for the Future
Gene Cooperman
-
URDB (Universal Reversible Debugger) has now been released on
sourceforge. That URL
also has a pointer to a technical report at arxiv.org. While this
pre-alpha release of URDB still has many bugs, it already opens
up a vision for a future with temporal debugging, time-traveling
program-base introspection, and program-controlled speculative execution.
This talk will discuss these future possibilities.
-
Oct. 22, 2009
Title: Parallel Dense Polynomial Arithmetic on Multi-Cores
Speakers: Marc Moreno Maza (U. Western Ontario) and Yuzhen Xie (MIT)
Contact info (usernames): moreno and yxie
Contact info (hostnames): csd.uwo.ca and csail.mit.edu
-
We aim at multicore-enabling FFT-based
dense polynomial arithmetic over finite fields.
We show that balanced input data can maximize
parallel speedup and minimize cache complexity
for bivariate multiplication.
We present effective techniques to reduce
multivariate (and univariate) multiplication to
balanced bivariate multiplication.
This parallel multiplication also provides an efficient kernel
for fast division and normal form computation
so that composition of multi-level of parallelism succeeds.
Our implementation in Cilk++ demonstrates good speedup
on multi-cores.
-
Oct. 15, 2009
Title: Roomy, One Year Later: A Language for Using High-Latency Storage
as Main Memory
Daniel Kunkle
-
A complete overview of Roomy was first presented in this seminar almost
exactly a year ago, with an update almost exactly half a year ago.
Roomy is now moving toward a formal beta release.
This talk will present a self-contained overview, and then tell
what's new in the evolution of Roomy. The previous abstract follows.
In our continuing effort to use "Disk as the New RAM", we are
developing a new programming language: Roomy. Implemented as a library
for C/C++, Roomy provides several highly scalable data structures,
which can grow to many terabytes in size. These data structures can be
stored on many disks, across a cluster of computers, but the
complexities of parallelism and efficient disk-based operations data
are hidden from the programmer.
This talk will cover: some of the underlying structure of Roomy; the
high-level data structures and API; and an example application using
Roomy.
-
Oct. 8, 2009
Title: Empirical Measurement of NAND Flash (followed by a separate
short talk)
Peter Desnoyers; and short talk by Vlad Slavici, Xin Dong and Gene Cooperman
-
Everyone knows that flash memory wears out. But how quickly?
The only people who know aren't talking, so we tested a bunch of chips
and have some surprising results.
The separate short talk will be led by:
Vlad Slavici, Xin Dong and Gene Cooperman. They will discuss their
efforts to find a faster permutation multiplication algorithm,
with striking successes on one machine, and a disappointing failure
on another machine. The audience will be invited to help diagnose
the performance bug on the second machine.
-
Oct. 1, 2009
TITLE: URDB: A Universal Reversible Debugger
Ana Visan
-
While there is a long history of reversible debuggers, such examples
were always specific to some particular debugger. This talk
presents URDB (A Universal Reversible Debugger), which produces a wrapper
around almost any debugger that adds reversibility. Source code for the
debugger is not required. Reversible debuggers have been demonstrated for
four widely used debuggers: MATLAB, python, perl, and gdb. Each new reversible
debugger requires modifying 200 lines of code to specify the target
debugger commands and output format. URDB is based on DMTCP
(Distributed MultiThreaded CheckPointing). URDB was made practical
by DMTCP's ability to checkpoint a target debugging session in about 1.5
second, and to restart in about 0.5 seconds.
This talk will also cover our experience of working
in a team of six people under deadline pressure.
-
Sept 24, 2009
TITLE: Permutation Multiplication and High-Performance Computing
Vlad Slavici
-
The naive way to compose (multiply) two permutations is to represent
the first one as an array X, the second one as an array Y and write this
piece of code:
for (i = 0; i < N; i++){
Z[i] = Y[X[i]];
}
Depending on whether the permutations fit or don't fit in RAM, on the
architecture (single-core, multi-core or cluster), on the length of
the L2 cache line (and very likely other parameters which we are not
going to discuss), the naive algorithm is sometimes outperformed by an
algorithm that uses sequential access patterns and is aware of cache/RAM
locality. Other times, the naive algorithm is more efficient.
We discuss the ways in which this simple exercise brings out important
properties of current architectures and how the Roomy library (created
by Daniel Kunkle) provides a natural translation between the naive and
advanced algorithm with no effort from the programmer.
-
Sept 17, 2009
TITLE: Organizational Meeting
Gene Cooperman
-
We will discuss the organization of the seminar for the
Fall, and Gene will then lead a discussion of current activities,
interests, and goals of the seminar members.
-
May 20, 2009
TITLE: DMTCP: Transparent Checkpointing for Cluster Computations and the Desktop
Jason Ansel
-
This is a special seminar. Jason Ansel will give a presentation based
on his talk to be given at IPDPS in Rome on Tuesday, May 26.
This is joint work with Kapil Arya and Gene Cooperman.
-
Apr. 2, 2009
TITLE: Roomy enough for Data Mining?
Speaker: Mirek Riedewald
-
We want to explore if Roomy can be used to simplify
implementation of scalable data mining algorithms. This will be a working
seminar. We will start with a brief summary of existing sequence mining
algorithms (Mirek) and a summary of Roomy's features (Dan, Gene). Then we
will explore using Roomy for sequence mining and maybe other mining tasks.
-
Mar. 26, 2009
TITLE: Permanent denial of service attacks on flash: Forcing fast wear out
Speaker: Simona Boboila
-
Flash memory is a non-volatile, electrically erasable programmable read-only
memory. Its main restriction is the limited number of write/erasure cycles
(i.e. 10,000 - 1,000,000) that it can sustain, after which it wears out,
becoming unreliable. To cope with this limitation, most flash devices have an
intermediate software layer, called Flash Translation Layer, that maps logic
locations to empty physical locations transparently. The mapping is used to
distribute the erase operations among all physical sectors and thus increase
the lifetime of each sector. We are investigating methods to carry out a
permanent denial of service attack on a flash device, by by-passing the Flash
Translation Layer and wearing out the device.
-
Mar. 19, 2009
TITLE: The Compressed Skycube
Speaker: Donghui Zhang
-
This talk presents the SIGMOD'06 paper on the Compressed Skycube (CSC).
The CSC supports the skyline queries on an arbitrary set of attributes,
e.g. considering (#points, #rebounds, #assists), who are the best NBA
players? Here a player is one of the best NBA players if there does not
exist another player who is better in all dimensions. Before CSC, one
would either sacrifice space and update cost (pre-compute and maintain all
skylines), or sacrifice query cost (compute skylines at query time).
The CSC presents a solution that is efficient in both aspects.
-
Mar. 12, 2009
TITLE: Parallel Disk-Based Coset Enumeration: Problems and Possible Solutions
Speaker: Vlad Slavici
-
Coset enumeration is a problem related to group enumeration. The
idea is to enumerate a group whose elements have no known unique
representation. Instead, the input is given as a set of relations between
generators which hold for all elements of the group (each relation
describes a cycle in the final directed graph of the group). The output
is a unique representation for the elements of the group (in this case
a permutation representation).
The algorithm that we have now presents some problems which make it
inefficient: Generating new elements in breadth-first order requires
a very large number of steps to get to a reasonable size (hundreds of
gigabytes) -- the real branching factor seems to be close to 1 (maybe
1.1) -- each time a new level is generated by the BFS, external sorting
is required -- slow.
We are currently investigating the rate at which duplicates are generated,
and how many steps are necessary until all duplicates from one step are
processed. Duplicate detection is layered - each layer takes long to process.
There are some potential solutions which will be presented as well.
-
Feb. 26, 2009
Title: TOPOFIT-DB: A Collaboration in Bioinformatics
Speaker: Xin Dong
-
This talk describes some of the real-world issues that one encounters
in collaborating with researchers in bioinformatics. In particular,
the application drives the research, and not computer science techniques.
This is because the goal is scientific results, not new computer science.
For example, missing some cases is fine, as long as the database collects
a large sample of the interesting cases. Better to find only some
interesting cases, but find them first --- rather than find all the
interesting cases, but find them second. Or, as
as my collaborator in bioinformatics has said:
I have a fishing boat, and I don't want you to build a rocket on
top of it. I just want something that will help me
catch more fish.
-
Feb. 19, 2009
Title: Scalable techniques for data management and analysis
Mirek Riedewald
-
Most sciences already are producing an abundance of data, and analyzing this
rapidly growing wealth of information has become a major challenge. This
provides an exciting opportunity for developing novel approaches that will have
an impact both in computer science as well as in the domain sciences. I am
interested in developing techniques for distributed data analysis, for mining
observational data, for making realistic scientific simulations feasible
through data mining, and for real-time processing of massive data streams.
-
Feb. 12, 2009
Title: Results on checkpointing ptrace with applications to GDB
Speaker: Ana M. Visan
-
Abstract: We are now close to being able to checkpoint ptrace.
We will discuss this architecture and the implications for
checkpointing gdb.
-
Feb. 5, 2009
Title: Progress Report on DMTCP
Kapil Arya, Gene Cooperman, Dan Kunkle, Praveen Solanki, Ana Visan
-
This week will concentrate on an overview and discussion of the current
status and progress on DMTCP (http://dmtcp.sourceforge.net). Some of the
areas of interest are:
checkpointing X-Windows (via TightVNC), checkpointing ptrace (which
opens up checkpointing of gdb), checkpointing of the various shells
(tcsh, bash, dash), pid virtualization (coming, at last), checkpointing
of screen, testing of DMTCP in Condor, testing of DMTCP with SCIRun,
and other topics as they come up.
-
Jan. 29, 2009
Title: Discussions with Marc Snir
Marc Snir
-
Marc Snir will give the Distinguished Speaker Series talk on the same day
in the afternoon.
In addition, he will be available for informal discussions from
10:30 - 12:30 in 366 @VH.
-
Jan. 15, 2009
Title: State of the Lab
Gene Cooperman, Peter Desnoyers, and others
-
We will review the ongoing research projects, recent successes,
and future challenges.
-
Dec. 4, 2008
Title: Group-theoretic Algorithms for Matrix Multiplication (cont.)
Emanuele Viola and Gene Cooperman
-
This continuation of the previous lecture will be jointly led by
Emanuele and Gene. It will be a working session, in which we try
to gain intuition about what types of groups are most likely to be
useful for a faster matrix multiplication algorithm. Recall that we
wish to find groups G with sets S_1, S_2 and S_3 (indexing rows and
columns in our matrices) satisfying the
triple product property for correctness. We also require for efficiency
that |S_1||S_2||S_3| >> \sum d_i^3.
The sum of the cubes of the character degrees d_i of a group is given
by the GAP function:
MySumOfCubes := group -> Sum(CharacterDegrees(group), pair->pair[2]*pair[1]^3);
Gene will add some group-theoretic background to the discussions.
For those wishing to "read ahead", one can look at:
Character theory, and
Fourier Transforms on Finite Groups.
Both topics will be covered during the seminar. One can also gain
further background by reading the chapter on fast polynomial multiplication
in Cormen, Leiserson, Rivest and Adleman.
-
Nov. 20, 2008
Title: Group-theoretic Algorithms for Matrix Multiplication
Emanuele Viola
-
We present the group-theoretic approach
to matrix multiplication by Cohn and Umans, later
joint with Kleinberg and B. Szegedy. This approach
gives algorithms for matrix multiplication that
run in time n2.41, which is close to the
current record n2.376 by Coppersmith-Winograd.
Under certain group-theoretic conjectures, the
approach gives algorithms that run in the optimal
time n2.
-
Nov. 13, 2008
Title: Multi-Threaded Geant4: A General Approach to Converting Sequential
Processes to Multi-Threaded Processes
Xin Dong
-
Our latest experiments show that our newest implementation of
multi-threaded Geant4 reduces per thread memory consumption
greatly. Copy-on-write semantics (child processes forked from
a single parent) uses 380MB total for one master and three worker
processes. Multi-threaded fullCMS now consumes 240MB for one master and
three worker threads. Copy-on-write uses 70MB per worker process and
multi-threaded fullCMS uses 22MB per worker thread.
After a brief review of Geant4, we discuss our ideas for a general
approach to converting many single-threaded processes into a single
multi-threaded process with good data sharing. This is important
for future many-core CPUs to avoid thrashing between CPU cache
and RAM.
-
Nov. 6, 2008
Title: Using Parallel Disks for Coset Enumeration
Vlad Slavici
-
Coset enumeration is a mathematical algorithm similar to the construction
of certain finite automata. It is used to construct large
permutation groups. As one scalses the algorithm, the memory usage
grows well beyond the available RAM. We discuss how to extend the algorithm
by using parallel disks to construct very large permutation groups.
-
Oct. 30, 2008
Title: Roomy: A Language for Using High-Latency Storage as Main Memory
Daniel Kunkle
-
In our continuing effort to use "Disk as the New RAM", we are
developing a new programming language: Roomy. Implemented as a library
for C/C++, Roomy provides several highly scalable data structures,
which can grow to many terabytes in size. These data structures can be
stored on many disks, across a cluster of computers, but the
complexities of parallelism and efficient disk-based operations data
are hidden from the programmer.
This talk will cover: some of the underlying structure of Roomy; the
high-level data structures and API; and an example application using
Roomy.
Because Roomy is still in the early stages of development, feedback
and suggestions during the presentation will be valuable and
encouraged.
-
Oct. 23, 2008
Title: An Architecture for Checkpointing GDB Using DMTCP (cont.)
Praveen Solanki and Ana Visan
-
This is a continuation from the previous week.
-
Oct. 16, 2008
Title: An Architecture for Checkpointing GDB Using DMTCP
Praveen Solanki and Ana Visan
-
This will be a working session in which we all try to look for
advantages and pitfalls of the architecture that they propose.
This will allow them to refine their blueprint for checkpointing gdb.
-
Oct. 9, 2008
Title: Checkpoint-Restart: Progress Report
Kapil Arya, Gene Cooperman, Praveen Solanki, and Ana Visan
-
The DMTCP checkpoint-restart package is now becoming more mature.
It successfully checkpoints and restarts a set of 20 common
interactive applications, as well as computation-intensive packages
such as MPI.
As one example, while we cannot yet checkpoint gdb (see below),
one can now attach restarted processes using gdb.
This is important in cases where a bug appears only after many
minutes, hours or days. One can checkpoint a long-running process
every minute, and then restart under control of gdb from the last
checkpoint before the bug, This speeds up the debug cycle during
repeated gdb sessions.
As a second example, there is progress toward the important
goal of being able to directly checkpoint ongoing gdb sessions.
This will provide a platform for treating gdb sessions similarly
to version control systems, including multiple snapshots of the
state of a gdb debugging session, while maintaining different branches.
It also provides what is colloquially called a "reversible debugger".
-
Oct. 2, 2008
Title: Flash: New Directions in Storage
Peter Desnoyers
-
This talk will highlight important research opportunities in flahs storage.
-
Sept. 25, 2008
Title: Overview of ongoing projects (cont.)
Various speakers
-
Series of 5 or 10-minute overviews of ongoing projects.
-
Sept. 18, 2008
Title: Overview of ongoing projects
Various speakers
-
Series of 5 or 10-minute overviews of ongoing projects.
-
May 8, 2008
Title: Summing up for this academic year
Gene Cooperman
-
We will use this to sum up where we are as a working group, and how
to organize best for the remaining opportunities.
-
May 1, 2008
Title: Graceful Degradation of Compact Hash Arrays
Peter Dillinger
-
I'm working on using lossless sets of hash values to
over-approximate sets. I use these rather than inherently
probabilistic structures so that I can rehash the sets to accomodate
more elements in the same amount of space. Further, I want to be able
to rehash within the same space itself (rather than allocating a new
copy and copying all the elements).
A data structure by John G. Cleary[1] is clearly capable of this, but
making the structure reasonably fast but also compact and flexible
while supporting these kinds of rehashing presents some interesting
challenges.
A data structure of mine, the Hash Pyramid, appears to be capable of
this kind of rehashing except that (1) it would be much slower to
rehash and (2) it's not clear whether it can rehash in place to a
k=1 Bloom filter, as the Cleary structure can.
I plan to discuss these challenges and get some brainstorming going.
[1] John G. Cleary: Compact Hash Tables Using Bidirectional Linear
Probing. IEEE Trans. Computers 33(9): 828-834 (1984)
http://ieeexplore.ieee.org/iel5/12/35240/01676499.pdf
-
Apr. 23, 2008
Title: Report on Trip to CERN
Gene Cooperman
-
In the visit to CERN, Xin Dong and I proposed a software collaboration
based on: (1) automatic multi-threaded parallelism; and
(2) checkpointing. This was part of the Workshop on Multi-Core
Computing and Virtualization in the LHC. Those two topics are
WP8 and WP9 in the planned software efforts at CERN. We also received
a presentation by the CMS group, where they describe a 2 million
line program with about 1,000 packages, and 900 shared libraries
(of which 400 are present in a given application). At CERN, they
have discovered new limits on software scalability not typically
seen elsewhere (although they anecdotally report that Firefox/Mozilla
is having similar experiences). Two issues are: (1) the time
to link so many shared libraries; and (2) a 1.5 MHz malloc
rate due to use of STL/C++ code from non-C.S. specialists.
-
Apr. 17, 2008
Title: Dry Run of Thesis Defense
Eric Robinson
-
I'll be doing a dry-run of my thesis defense at the system's seminar this
week, April 17th, 1:30-2:30PM. I'd appreciate as much feedback as I can get.
-
Mar. 27, 2008
Title: Linear Algebra for Graph Algorithms: Betweenness Centrality
Eric Robinson
-
With the advances that have been made with modern mathematical software and
specifically with sparse matrix representation and storage, representing a
graph as an adjacency matrix is no longer space-inefficient. It is now
possible to effectively compute over large graphs using software such as
Matlab, Maple, Mathematica, etc.
To demonstrate this, the ease of implementing a few of the basic graph
algorithms using an adjacency matrix representation is shown. After this, a
more complicated algorithm, betweenness centrality, is examined.
Betweenness centrality seeks to compute the importance of each node in a
graph by determining the number of shortest paths that it lies on. The
algorithm's implementation using an adjacency matrix graph representation
and only basic matrix and vector operations, such as matrix multiplication
and element by element operations, is presented.
The ability to represent complex graph algorithms such as this as linear
algebra problems in Matlab, etc. provides a reduced development time in many
cases. The resulting code is significantly smaller. In addition, it
provide access to "free" parallel versions of the algorithms with software
such as pMatlab, starP, and parallel Matlab.
-
Mar. 20, 2008
Title: A New Approach to Direct Condensation using Parallel Disk as Primary
Storage
Ana Maria Visan
-
Condensation is a method through which we can reduce the size of the
object to be analyzed and that guarantees to retain enough information
for usefulness.
Condensation is a critical technique for the Modular Atlas project,
which analyzes modular matrix representations for finite groups.
For large simple groups, these representations are more efficient
computationally than the next alternative: permutation representations.
A new approach to direct condensation is presented that uses parallel
disk as primary storage. The running time of the new approach is O(SN),
where N is the size of the orbit being
condensed and S the size of the generating set. Previous algorithms had
a running time of O(LSN), where L is the landmark ratio, thus the new
algorithm is an improvement over the previous algorithm.
Using the new disk-based approach, we produce condensation matrices for
the sporadic simple Janko Group J4 on an orbit of size 131,358,148,251.
Each point of the orbit is a 4-dimensional subspace in a vector space of
dimension 112 over GF(2).
The new algorithm demonstrates a condensation of a group whose orbit is
10 times larger than the largest previous such group for which direct
condensation was applied.
-
Mar. 12, 20008; 12 noon (unusual time, place and day)
212 West Village H
Endomorphism rings and their character tables
Jürgen Müller
-
Starting points are the notions of distance-transitivity and -regularity,
as well as the Ramanujan property, in algebraic graph theory. These are
closely connected to the action of the graph automorphism group. From
that point of view, graphs appear as orbital graphs associated with a
permutation action of a given finite group.
Given a permutation module, representation theory tells us to consider
its endomorphism ring, which similar to the group case turns out to have
a character table. The latter finally contains information about purely
graph theoretic properties, like the ones mentioned above, of the
associated orbital graphs.
Together with T. Breuer we have built a database containing the
character tables of the endomorphism rings of the multiplicity-free
permutation actions of the sporadic simple groups. We report on this,
as well as on the computations necessary to compile the data. In
particular, to tackle the large cases, together with M. Neunhoeffer
and R. Wilson we have invented a new general technique to enumerate
very large orbits.
-
Mar. 6, 2008
"Proving 25 Moves Suffice for Rubik's Cube; Disk-based Computation Wins Again
(PART TWO)"
Daniel Kunkle
-
This is a continuation of last week's talk. Having heard some of the
background, we will now hear about the algorithmic improvements needed to show
that 25 moves suffice.
-
Feb. 28, 2008
"Proving 25 Moves Suffice for Rubik's Cube; Disk-based Computation Wins Again"
Daniel Kunkle
-
Last year, we proved that 26 moves suffice to solve Rubik's cube,
beating the previous best upper bound on solution lengths of 27. This
was accomplished with two separate computations: a disk-based breadth
first search over 1.4 trillion states, which proved 29 moves
sufficient; and a large trivially parallel RAM-only computation to
refine the upper bound to 26.
This presentation will describe our continuing work to further refine
the upper bound to 25 (expected to complete in the next one to two
months). To accomplish this, we needed to design a new disk-based
version of the existing RAM-based refinement method. This example of
transforming an existing RAM-based computation into a disk-based
computation is representative of many such efforts in our research
group, and will serve as a general guide for the process.
-
Feb. 21, 2008
"Introduction to Distributed Multi-Threaded Checkpointing (DMTCP)"
Kapil Arya
-
DMTCP (Distributed Multi-Threaded CheckPointing) is a user
level transparent checkpointing tool for distributed, multithreaded
applications. DMTCP can also be used for debugging processes
as well as for process migration. It doesn't require any
modification of kernel or application binary and is based on TCP
sockets which gives it full generality.
There are many subtleties to designing a package with broad
coverage over such varied parallel libraries as OpenMPI, MPICH-2,
SciPy (python), along with parallel applications such as
ParGeant4 (a Monte Carlo simulation), etc.
Some of the issues to be discussed are: whole checkpointing
procedure including checkpointing of pseudo terminals, sockets,
shared memory etc.; and Scalability.
-
Feb. 14, 2008
"Overview of the NVIDIA G8 Architecture and Design Issues"
Gene Cooperman
-
This talk will review the NVIDIA G8 architecture with an eye toward
deducing the tradeoffs and design issues that were faced by the
designers of the NVIDIA G8 architecture.
-
Jan. 30, 2008, (11:45 a.m., 366 WVH -- note unusual time)
"Multi-Core Computing"
Gene Cooperman
-
While the PhD admissions committee has our time slot, we continue
to have a floating systems seminar. :-)
This Wednesday (tomorrow), we will hold a joint PL/Systems seminar
at:
11:45 a.m., Wed., 366 WVH
I will be talking informally (white board) about multi-core computing.
The goal is to find common issues affecting both systems and PL.
Part of the talk will be review, and part of the talk will be
speculation about the future.
-
Jan. 18, 2008 (1:45 p.m. -- note unusual time)
"Alternative Designs for Automatic Transformations
to Implement Thread Parallelism in the Presence of C++ Inheritance
Xin Dong
-
Xin Dong will speak
on the issue of implementing automated thread parallelism in the
presence of object inheritance in C++. Xin will briefly present
the issues and then ask the seminar attendess to help make suggestions
and debug design possibilities. Xin already has a simple transformation
for the common case encountered in Geant4 code. In the presence
of inheritance, his simple transformation becomes harder to
implement (since it must take account of the base class), and
Xin is looking for an audience to discuss alternative designs.
-
Jan. 10, 2008, 1:30 p.m.
"Experience in Implementing Thread Parallelism for Geant4"
Xin Dong
-
Xin Dong will present his experience in implementing thread parallelism
for Geant4. Some of the key ideas were simple transformations for
singleton class, and the careful use of the relatively new
__thread keyword, introduced in C99 and C++98.
-
Nov. 8, 2007 (1:30 p.m. -- note unusual time)
"Disk-Based Parallel Coset Enumeration" (cont.)
Vlad Slavici
-
Vlad will continue the discussion by explaining his parallel,
disk-based strategy for coset enumeration. If time permits,
Gene will give the background for BDDs (Binary Decision Diagrams) ---
another pointer-chasing problem. BDDs are used in verification,
and a similar parallel, disk-based strategy may be available
for this algorithm.
-
Nov. 1, 2007 (1:30 p.m. -- note unusual time)
"Disk-Based Parallel Coset Enumeration"
Vlad Slavici
-
First, a short description of the general way to convert an NFA to a
DFA. If the NFA has n states, the DFA will have at most 2^n states. All
the states of the DFA can be represented by an n-bit string.
Generation of a new state takes at most O(n) time using bit string
operations. Pointer chasing is an issue because the next state may be
owned by any node. Still, all the generated states are valid, so the
disk space will be continually increasing until reaching the final
configuration.
Then a problem where pointer chasing visibly dominates the computation:
Coset enumeration is the problem of converting a presentation
representation of a group into a
permutation representation. Many problems require the permutation
representation of a group as input. In order to obtain the result - a
complete coset table - all known algorithms define additional cosets,
most of which turn out to be duplicates. Basically, there are two
approaches to this problem:
HLT - which defines as many cosets as necessary to fill a relator loop(
a circular sequence of cosets of length equal to the order of a
generator) and then tries to collapse them
Felsch - which tries to collapse cosets as soon as it defines new ones
HLT is easier to implement, but since previous approaches only used RAM,
Felsch was preferred for using less space. For HLT, the intermediate
space necessary overflowed RAM, or even one disk.
Our approach is to use the HLT strategy, but using parallel disks, so
that the space necessary for keeping the intermediate states would not
be an issue any more. The strategy will be presented, as well as some
rough estimates of the number of accesses to disk for a given input size.
-
Oct. 25, 2007 (3:00 p.m. -- note unusual time)
"Twenty-Six Moves Suffice for Rubik's Cube"
Daniel Kunkle
-
The minimum number of moves sufficient to solve any configuration of
Rubik's Cube has been a matter of long-standing conjecture -- since
the puzzle appeared over 25 years ago. This number is sometimes known
as "God's Number". This talk will describe the methods we have used to
show that God's Number is no more than 26.
To achieve this result, we used new techniques in two areas: group
theoretic methods for efficiently representing and computing over the
possible configurations of Rubik's Cube; and, parallel disk-based
algorithms for enumerating large search spaces. This talk will
primarily focus on the disk-based search and enumeration techniques we
have introduced.
-
Oct. 18, 2007
No systems seminar
-
The systems seminar is preempted by a faculty meeting at the same time.
-
Oct. 11, 2007 (3:00 p.m. -- note unusual time)
"Distributed Disk-Based Search and Enumeration: Discovering the
Shortest Path to a Thesis"
Eric Robinson
-
With the introduction of multi-core computing, Moore's Law continues
to hold. Problems that were previously infeasible due to computation
time are now being re-examined. For many of these problems, however,
the issue at hand is no longer CPU time, but rather available
storage. This is especially true in the case of search and
enumeration problems, which are typically characterized as being on
the order of $O(N)$, but may require the storage of billions or
trillions of ``values''.
Large search and enumeration is examined, and the case for using
distributed disk is made. Fundamentally, disk is on the order of 10
times slower than RAM when streamed. This slowdown can be
compensated for through the use of a cluster of disks or a RAID
system. For this, the application programmer has access to on the
order of 100 times more storage. Many techniques for search and
enumeration are presented, including one method developed in-house,
and a ``fair'' method for comparing these techniques is given.
The talk shall also follow my research as it developed through my
5.13 years (currently) at Northeastern, and will demonstrate some of
the goals and pitfalls of being a Ph.D. student. These will be
presented through a look at the progress, or lack thereof, of my
research during my time at Northeastern.
-
Oct. 4, 2007
No meeting this week.
-
-
Sept. 27, 2007
"Parallel algorithms in numerical algebraic geometry"
Anton Leykin,
Institute for Mathematics and its Applications (IMA),
U. Minnesota
-
Numerical algebraic geometry (NAG) is a young area encompassing
numerical methods that deal with systems of polynomial equations.
The main engine of NAG, the polynomial homotopy continuation
procedure, is easily parallelizable. This makes NAG algorithms
suited for parallel and distributed computing better than those
based on classical symbolic techniques such as Gröbner bases.
This talk will start with an introduction to NAG and continue with
sketches of two parallel algorithms. One is a core algorithm for
parallel monodromy breakup, the other is an application to pure
mathematics; it computes Galois groups of simple Schubert problems.
Time permitting we will discuss the recent algorithm for numerical
primary decomposition, which enables us to solve the ideal
membership problem numerically.
-
Sept. 20, 2007
"Structures for potentially probabilistic state enumeration in RAM"
Peter Dillinger
-
Part 1: Classical approaches to representing visited lists can require
much more memory than is theoretically required to store subsets of
some finite universe of elements. In particular, when the subset is
neither particularly dense nor particularly sparse in the universe,
discrete information theory dictates that classical approaches are far
from optimal. For example, it is possible to represent 200,000
arbitrary 32-bit values (such as IP addresses) using less than 16 bits
per value. My data structure, the "hash pyramid," shares enough structure
between elements to use only 17.5 bits per value in this case. A
nearly-forgotten structure by Cleary is similarly compact, but only
when made quite slow.
Part 2: When an exact visited list cannot fit in RAM, it is useful in
many applications to resort to an over-approximated visited list.
Because the elements are hashed, there is a predictable probability
of whether a new state will erroneously be considered visited. Once
again, discrete information theory can tell us how much space is
required to achieve certain probabilities. One underacknowledged
structure, the "hash compact" table, is nearly optimal, but only if
the number of reachable elements is known in advance. In contrast,
the popular Bloom filter is more generally mediocre, meaning it is
likely to perform relatively well when the number of reachable elements
is not well known. Pete Manolios and I were the first to contrast
these structures in this way.
Part 3: Structures for exact sets can also be used as
over-approximations, by exactly storing a set of hashes. The hash
pyramid and Cleary structure are quite competitive when used in this
way, and they have a property that neither the hash compact table nor
the Bloom filter has: they can be rehashed to accomodate more entries.
The Cleary structure in particular can be rehashed in place with linear
scans.
Part 4: All of these schemes depend on hashing. Two hash functions in
particular are quite useful in explicit-state verification; these have
been implemented in our 3Spin and 3Murphi tools based on Spin and
Murphi. The first is the Jenkins LOOKUP2 function, which is
heuristically accurate and predictably fast. The second is H3, which
has "universal hashing" accuracy guarantees, and can use a
differential optimization to make it heuristically fast. Thus, these
two hash functions are nice counterparts.
-
Sept. 13, 2007
"Organizational Session"
Gene Cooperman
-
We will have an organizational session in which Gene will give a short
report on interesting applications he found on his recent travels,
and others will give a five minute report on the current status of
their research projects. As with all systems seminars, additional
participants/listeners are welcome.
-
Mar. 30, 2006
"How Does one Solve Rubik's Cube Optimally?"
Dan Kunkle
-
A classic search problem is finding shortest solutions to Rubik's cube.
Researchers have looked at this problem with a single processor
and external memory (disk-based algorithms), and with parallel processors.
We ask how to use all of the weapons available (parallel processors
and external memory) to do better than previously. This talk will
describe some building blocks of such an algorithm. In the remainder
of the talk, methods to putg together the building blocks will be examined
as a group activity, with audience participation.
-
Mar. 15, 2006 (Wed.)
"Symmetry, Search, and Research Problems in
Applications to Sudoku and Latin Sqares"
Ravi Sundaram, Dan Kunkle, and Gene Cooperman
-
For this week, only, we hold the systems seminar on Wednesday.
We will employ the often-used triple speaker format --- choosing
the variation where lectures are given in sequence rather than in parallel.
Gene will present the formalisms of how to use computational
group theory to efficiently support symmetries for pruning large
search problems. Ravi Sundaram will present results on Sudoku
and Latin Squares from the Theory Community, while Dan Kunkle
will present results and open problems from outside the
Theory Community.
-
Mar. 9, 2006
"The Issues of Choosing a Good Thesis Topic"
Dan Kunkle
-
I first considered applying a novel methodology in top-down style,
in order to demonstrate on a concrete problem (numerical analysis).
I am now considering the opposite, bottom-up style:
starting with a concrete "challenge problem" in search and applying
novel out-of-core techniques to solve it. He would then use
the eventual solution to enunciate a general methodology.
-
Mar. 2, 2006
"Task Oriented Parallel ILU(k) Factorization and Its Optimization"
Ye Wang
-
This is a continuation of the previous week's talk.
-
Feb. 23, 2006
"Task Oriented Parallel ILU(k) Factorization and Its Optimization"
Ye Wang
-
I will briefly introduce the classic(sequencial) ILU factorization
algorithm and the task oriented parallel ILU(k) algorithm. According to
the features of the algorithm and the spase matrices, I will optimize the
TOP-ILU algorithm.
-
Dec. 7, 2005
"Checkpointing Master-Worker Parallel Computations"
Jason Ansel
-
I will explore a system-level transparent method of checkpointing
in which a snapshot is taken only of one node. In addition I will talk
about utilizing system core dumps as a way to checkpoint a process and
some of the issues encountered on the way.
-
Nov. 30, 2005
"Developing Task-Oriented Parallelism for Linear Algebra" (continued)
Ye Wang
-
This is a continuation of the previous talk in this seminar.
-
Nov. 16, 2005
"Developing Task-Oriented Parallelism for Linear Algebra"
Ye Wang
-
This talk will focus on solving important linear algebra
problems, such as ILU preconditioner, using a new tool TOP-C.
I will introduce the "Classic" algorithm for ILU preconditioner as well as
my new task-oriented algorithm, and make comparison of the two algorithms.
-
Nov. 9, 2005
"Computational Algorithms for Finding Short Solutions for Rukik's Cube"
Gene Cooperman
-
This lecture will be a "fun" lecture --- no prior experience
necessary. Rubik's cube has long held a fascination for computer
scientists ranging, for example, from Korf (in A.I.) to Shamir (in Theory).
Finding good solutions (or the best solution) requires enormous
computational resources in both space and time. This makes it a wonderful
testbed for systems: stressing all of network, CPU, disk and bandwidth
of RAM.
We will review a variety of algorithms, including: Sims's algorithm
for finding some solution; Korf's modification for finding a
reasonable solutions (within a factor of two); Shamir's algorithm
for finding the best solution (computationally feasible for
solutions up to about 18 or 19 moves); my own enumeration of a subset
of states of Rubik's cube using only two bits (0.25 bytes) per state
in a Cayley graph from which shortest moves in the subset can be recovered;
algebraic bounds on the maximum solution length over all states of Rubik's
cube; and the "shape" of any breadth-first enumeration of states, in terms
of the size of each successive level.
-
Nov. 2, 2005
Automatic Algorithm Selection
Dan Kunkle
-
I will describe my proposed automatic algorithm selection
system: a system for determining the right algorithm and its
parameters given a specific problem to solve. This method is
particularly useful when here are many possible algorithm variants for
a single problem, the appropriate algorithm is dependent on the actual
problem instance being solved, there are no simple rules mapping
problem instance to algorithm variant, and the algorithms are
dependent on implementation and computing architecture. The system
involves methods from multiobjective optimization, case based
reasoning, and supervised learning. I will highlight the process of
constructing the automatic algorithm selection system with an
illustrative example: high performance solving of linear systems of
equations.
-
Oct. 26, 2005
"Distributed Framework for Computation of Very Large Trees"
Eric Robinson
-
This talk focuses on the computation of trees too large to be computed on
a single machine. For trees such as this, it is useful to have an
abstract framework to change a local tree computation into a distributed
one.
We provide such a framework and show how it can be used to compute over
two different trees. One used in the computation of matrix groups, and
the other used to compute chess endgame trees.
Talks from Previous Semester (Spring, 2005)
-
Feb. 3, 2005
"Ischemic Stroke and a new approach in Machine Learning"
Viet Ha Nguyen
-
For the introductory lecture this semester, Viet Ha Nguyen will talk on
his work in learning and ischemic stroke. He will informally describe how
his application-oriented thesis led to a new approach in machine learning.
Ha has worked closely with a medical research group at MGH (Mass. General
Hospital) for six months. At the end of six months, he had solved
the problem that the "doctors" had assigned him: improve the accuracy
of a model predicting which parts of the brain were at risk of dying
after ischemic stroke.
However, this striking success in a particular medical application
still needed to be translated into a general computer science technique,
suitable for a thesis. Ha will describe both his original result, and
how he generalized that result, so as to apply to more than a single
application.
-
Feb. 18, 2005 (note, Friday)
"Fast Query Processing Using Cooperative CPU Caching for Index Structures"
Xiaoqin Ma
-
Data Intensive applications on clusters often require that requests
quickly be sent to the node managing the desired data. In many
applications, one must look through a sorted tree structure to
determine the responsible node for accessing or storing the data.
Examples include sensor networks, network routing tables, routing in
publish-subscribe middleware, and database indexes. When the tree
structure is larger than the CPU cache, the standard implementation
potentially incurs a cache miss for a lookup at each successive
level of the tree. This significantly degrades the performance
because a single cache miss of one word will force an entire cache
line to be loaded (e.g. 128 bytes in the case of the Pentium 4). As
the CPU-RAM gap grows, this performance degradation will only become
worse in the future.
-
Mar. 10, 2005
"The Conquering of the Baby Monster: A Problem of Resource Balancing"
Eric Robinson
-
The Baby Monster is a permutation group in a well-known ladder of
challenge problems, the sporadic simple groups. Solving for this group is
difficult not only in terms of the time for the computation, but also
space. We provide a distributed disk-based solution and show that this
solution is only marginally slower than an optimal solution given infinite
disk and memory resources. In addition, we generalize this solution and
show its application in the area of large tree discovery.
-
Mar. 31, 2005
"Some Research Problems in Systems"
Gene Cooperman
-
I will discuss some open research problems in systems. This will
be an interactive talk. We will learn to propose research problems
"by doing it". Some of the topics below are meant simply to be
seeds for thought.
They include semi-automatic parallelization (one of my continuing
favorites). The TOP-C model should make it exceptionally easy to
parallelize for/while loops. There are standard programs for
transforming fragments of C, such as CIL. What other constructs could
be parallelized?
The MBRAM model can be used as the basis for choosing
among compiler optimizations. How can one do that effectively?
This could be a very hot area.
What about extending MBRAM to SMP or to distributed memory? For a
dual-core chip, what can you say?
The clusters of tomorrow will probably be multiple CPUs at each node,
just because of the trend toward dual and quad-core CPU chips.
How does one reorganize a parallel computation (TOP-C or other)
to take advantage of this?
Can the search methods of computational algebra be repackaged for other
kinds of search. In the case of Eric, chess end games had been suggested.
On the area of neural nets, what will be an efficient memory access
pattern if you have millions of neural net nodes at each layer?
The current rules of efficient memory access dictate different kinds
of sorting algorithms. Can one produce a faster integer sorting
algorithm than quicksort or mergesort? (Note that one can test one's
ideas without coding by using the MBRAM model.)
Groebner bases are an interesting construct in computational algebra,
which, in some sense, "search" for solutions to systems of
polynomial equations. Can that search be executed in a more cache-efficient
manner?
Cayley graphs are a kind of deterministic automaton. There are interesting
types of graphs that are most easily constructed using Cayley graphs.
Can one search for a good graph in such a space of graphs?
-
Apr. 7, 2005
"Towards a Mathematical Justification for Spatial Preconditioning in Spatially
Distributed Feature Learning"
Gene Cooperman
-
This Thursday, I will talk on the current work of Viet Ha Nguyen (since
he's in Japan). Ha had a nice applied solution to a problem in brain scans.
By generalizing, he developed a new heuristic: spatial preconditioning.
The idea is:
- learn the spatial influence first
- Then do feature learning locally, using the modified features based
on the spatial influence of nearest neighbors.
The idea has been demonstrated to work nicely in experiments. We now
need to look for a mathematical justification. The talk will discuss
the first attempts in that direction. - Gene