Mirek Riedewald

PhotoAssociate Professor
Northeastern University

Khoury College of Computer Sciences, 202 West Village H
360 Huntington Avenue
Boston, MA 02115

phone +1-617-373 4766


Cloud computing, distributed big-data management and analysis, data-stream processing, data-driven science

Research collaborations: I have been collaborating with industrial partners and with scientists from various disciplines since 1999. While specific challenges vary, there is always the same common theme: everybody is collecting and generating an ever increasing amount of data. In this world of big data and of data-driven science, groundbreaking discoveries depend on the ability to efficiently analyze and process these massive amounts of data. We have been designing scalable data management and analysis techniques for neuroscience, discovery and linking of personal information (e.g., as mandated by GDPR), ornithology, ecology, rocket science (really!), astronomy, and high-energy physics---to name a few.


Research vision: Create algorithms that scale in the size and complexity of data, with a focus on analysis problems motivated by grand challenges in Open Data and data-driven science.

What our PhD students do: design novel algorithms; prove lower bounds, upper bonds, optimality; build big-data systems; publish results in the premier CS and domain-science venues.

DATA Lab @ Northeastern logoProf. Riedewald is co-founder and co-leader of the DATA Lab @ Northeastern. Currently he focuses on the development of novel techniques for large-scale distributed data analysis, data management, and data mining. His research agenda is driven by collaborations with domain scientists and industry, with the goal to produce results that are publishable in both premier computer science venues as well as those in the application domain.


Current Projects

Distributinator: Scalable Big-Data Analytics

How do we effectively and efficiently use many machines in a cluster or in a cloud to solve a big-data-analysis challenge? What is the best way to partition a dataset so that running time of the distributed computation is minimized? How do we abstract a complex distributed computation so that we can learn a mathematical model of how running time depends on parameters affecting data partitioning?

Table-as-Query: unifying Data Discovery and Alignment

Fueled by advances in information extraction and societal trends that value institutional openness and transparency, structured data are being produced and shared at an overwhelming speed. Open-data sharing is central to supporting institutional transparency, but transparency is not achieved if shared data cannot be found and effectively aligned with other data being studied by data scientists, journalists, and others. This project will fundamentally contribute to the new science of open-data sharing by laying the theoretical foundations of data discovery and by designing a system that solve the problem at scale.

NCTracer Web

How do we turn 20,000 3D image stacks (10 terabytes per mouse brain) taken by a high-resolution light microscope into a coherent 3D image of the brain? How do we extract from this massive dataset a graph representing the neurons captured in the image? And how do we analyze this graph efficiently? Can we extend this approach to include other brain data, e.g., from fMRI and electron microscopes? And can we generalize our techniques to graph problems in other domains such as social network analysis?

Any-k: Optimal Ranked Enumeration for Conjunctive Queries

When a query on big data produces huge output, can we quickly return the "most important" results without even computing the entire output? If the notion of importance is difficult to define, can we return the top-ranked results so quickly that the user can try out different options (nearly) interactively? For what types of queries and data can this functionality be supported? And what are the best time and space guarantees we can provide?


Selected Past ProjectsScolopax logo

Scolopax: Making Analysis of Scientific Data Fast and Easy

Cayuga: A Scalable System for Data Stream Processing

Additive Groves Prediction Technique and Automatic Interaction Detection