Summary | Rationale | Powerpoint |
JPG Slide | JPG Image |
Credits | Comments | All Nuggets
Algorithms for understanding data on a massive scale
Computers can search data, but can they understand it and make discoveries of their own?
Summary
During the last decade we have witnessed a dramatic increase in the
amount of generated data. The web, network traffic, scientific,
health, financial and marketing data sets are a few examples of the massive and
heterogeneous data being produced in unprecedented volume. The data
often contains information of great value. For example, computers could cross-examine genetic and clinical health data in order to design cures for diseases. However, extracting knowledge
from terabytes of raw data requires extremely efficient algorithms,
posing a great challenge to algorithm designers.
Rationale
Answering that challenge requires: (1) Developing computational
models that capture the essential aspects of computing over massive
data sets, yet are expressive enough to support solutions to
important problems and (2) creating novel algorithmic techniques
capable of analyzing and extracting value from such immense inputs.
Theoretical computer science has begun to address these issues on
both fronts. Streaming (computing using a single pass over the data), sketching (extracting succinct data representations sufficient for solving a problem) and sub-linear time/space computing are
examples of models that led to the development of a vast array of
efficient algorithms. The framework of computational learning
theory, especially the notion of amplifying classification accuracy via boosting, yielded efficient and
effective algorithms for a variety of machine learning problems. The
notions of approximate and randomized algorithms were introduced to
handle problems too difficult to be solved exactly and
deterministically. Techniques such as random sampling, random
projections, Fourier sampling and locality-sensitive hashing were
developed and utilized for a variety of problems of practical
importance.
Many of the aforementioned developments are based on novel mathematical interpretations of data. Insights from geometry, spectral analysis and topology have been a source of inspiration for designing efficient algorithms. It is likely that such interactions hold the potential for even greater impact in the future.
Contributors and Credits
Petros Drineas, Piotr Indyk, Sampath Kannan, James Lee, Luis Rademacher, Madhu Sudan
Image Ideas
Computer wearing a lab coat ("computer as scientist"). Here is a bad example. A better example would have just a computer in a lab coat,
probably without a human body.
Comments
- to give feedback on this nugget, just add another bullet to this list
- Add "visual data" to the list of data types (photos, video, medical..). -- Avi
- Add Rationale, to include application examples of the mysterious "sketching, streaming" as well as local error correction. Even Boosting is nonobvious and can be elaborated on. Indeed, it may deserve a separate nugget! -- Avi
- Added explanations on streaming, sketching and boosting. - Piotr
- Presumably the whole text after the 1st paragraph can be moved to the Rationale section - Piotr
- We thought error correction would be a better fit with the "unreliable world" nugget. But maybe there is a good way to incorporate it here as well - Piotr
- The nugget focuses on past achievements. It may be useful to add a concrete challenge, perhaps the difficulty of computing k-way joins or some such thing. -- Udi
- At the end of first para, "efficient algorithms" -> "smart and efficient algorithms"? Else the subtitle "developing algos to understand the data" loses some of its effect: it feels like we're saying we already have algorithms that understand the data, just no fast algorithms that do so. --Anupam
- "Novel algorithmic techniques" -> "new, faster, smarter, better algorithms" (or something like that), shorter words might indicate more excitement. Also, last para is not for high-school students/congress lobbyists --- maybe it should move into rationale? --Anupam
- I agree with Udi that the future should be mentioned in a more concrete way. --Luis
- "understanding data on a massive scale" might be rewritten as "data understanding on a massive scale". the issue here is that in the first formulation a twisted mind will parse it so that massive-scale applies to understanding and not to data, so the result will be like: "Algorithms for massive understanding of data," which is kind of funny, i guess... Switching the order between understanding and data alleviates the problem somewhat. - Bernard
- Note from designer Elaine Park: I love the suggestion of "computer as scientist," as well as the idea of computational models that are "expressive." Here, that is brought to life through a futuristic computer/brain processing streams of data and generating an idea.
- We made a few changes, addressing some of the comments. In particular, we now have a Rationale section, and the last paragraph speculates (broadly) about future directions - Piotr and James
- I am agnostic on the "understanding data" vs "data understanding" front. Both phrases sound equally good in Polish-English :) - Piotr
- It may be better to move the first par of the rationale into the summary, and end the par with a sentence like "A rich and ongoing body of work in theoretical computer science addresses both of these issues." (This makes it sound less like a "completed" research direction than the sentence "TCS has already addressed...", which could then be dropped.) - Salil