Streaming algorithms are a set of techniques for computing on data streams in which the stream of data is presented as a sequence of elements with each element only being processed once. Additionally, these methods are only allowed a constant memory footprint, independent of the size of the input stream. These constraints often lead to required tradeoffs. Solutions are often correct to a threshold, with approximations being required. Understanding the vast application of these methods and how to best integrate them into their software should be a core competency in the modern engineers toolbox.
This talk will weave together
real-world applications of streaming algorithms encountered here at Socrata,
the types and data structures (with open-source implementations in Scala) used internally to implement our streaming algorithm data pipeline,
the theory and data structures behind approximation algorithms including the popular Count-Min Sketch and HyperLogLog as well as other lesser-known techniques, and
references to and examples of integrating JVM-based implementations of these methods found in Twitter’s Algebird and Clearspring’s stream-lib into the working engineer’s data pipeline.