Events

Feed icon 28x28
Scipy2016 original

Video recording and production done by Enthought

SciPy 2016 Schedule

July 13 - 15, 2016

( 92 available presentations )
Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

In the last 40 years over a petabyte of publicly available earth observation imagery has been produced. In the near future, many petabytes of imagery per year will become available from a combination of public satellite missions and private satellite constellations. At the same time, commercial cloud providers are competing to provide the lowest cost alternative to on-premise compute capabilities. By combining the dramatic rise in available imagery with low cost of high performance storage, network, and compute capabilities, we have a unique opportunity to combine analysis techniques from remote sensing, machine learning algorithms, and scalable compute infrastructure. Combined, they allow for global scale investigations into how our planet is changing.

Here we will report on how we leverage the commercial cloud to generate a tiled spatio-temporal mosaic of the Earth and how it enables fast iteration for the development of both traditional model based predictions and machine learning algorithms. As part of our effort, we have processed, in less than 24 hours, over a petabyte of compressed raw data from the combination of the US Landsat and MODIS programs, totalling nearly 3 petapixels. We will detail the challenges and benefits to moving from traditional remote sensing workbenches to the commercial cloud, with particular emphasis on the benefits for researchers.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

In this talk, I will introduce the audience to the emerging area of
computational social science, focusing on how machine learning for social science differs from machine learning in other contexts. I will present two related models -- both based on Bayesian Poisson tensor decomposition -- for uncovering latent structure from count data. The first is for uncovering topics in previously classified government documents, while the second is for uncovering multilateral relations from country-to-country interaction data. Finally, I will talk briefly about the broader ethical implications of analyzing social data.

Hanna Wallach is a Senior Researcher at Microsoft Research New York City and an Adjunct Associate Professor in the College of Information and Computer Sciences at the University of Massachusetts Amherst. She is also a member of UMass's Computational Social Science Institute. Hanna develops machine learning methods for analyzing the structure, content, and dynamics of social processes. Her work is inherently interdisciplinary: she collaborates with political scientists, sociologists, and journalists to understand how organizations work by analyzing publicly available interaction data, such as email networks, document collections, press releases, meeting transcripts, and news articles. To complement this agenda, she also studies issues of fairness, accountability, and transparency as they relate to machine learning. Hanna's research has had broad impact in machine learning, natural language processing, and computational social science. In 2010, her work on infinite belief networks won the best paper award at the Artificial Intelligence and Statistics conference; in 2014, she was named one of Glamour magazine's "35 Women Under 35 Who Are Changing the Tech Industry"; in 2015, she was elected to the International Machine Learning Society's Board of Trustees; and in 2016, she was named co-winner of the 2016 Borg Early Career Award. She is the recipient of several National Science Foundation grants, an Intelligence Advanced Research Projects Activity grant, and a grant from the Office of Juvenile Justice and Delinquency Prevention. Hanna is committed to increasing diversity and has worked for over a decade to address the underrepresentation of women in computing. She co-founded two projects---the first of their kind---to increase women's involvement in free and open source software development: Debian Women and the GNOME Women's Summer Outreach Program. She also co-founded the annual Women in Machine Learning Workshop, which is now in its eleventh year. Hanna holds a BA in computer science from the University of Cambridge, an MSc in cognitive science and machine learning from the University of Edinburgh, and a PhD in machine learning from the University of Cambridge.

Links to materials and sites referenced in Hanna's talk:
http://dirichlet.net/
https://github.com/hannawallach/pytho...
https://github.com/hannawallach/cmpsc...
http://mallet.cs.umass.edu/topics.php
https://radimrehurek.com/gensim/
http://scikit-learn.org/stable/module...
https://github.com/aschein/bptf
http://www.fatml.org/

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list....

Capture2 thumb
Rating: Everyone
Viewed 14 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Current plotting tools are inadequate for revealing the distributions of large, complex datasets, both because of technical limitations and because the results vary dramatically depending on the dataset itself. Avoiding these problems requires either prior knowledge of the distribution or tedious trial-and-error parameter adjustment, neither of which is necessarily feasible for the data now being collected. The new datashader library (https://github.com/bokeh/datashader) makes it practical to work with data at a large scale, easily and interactively visualizing millions or billions of points. In this talk, we'll demonstrate how datashader provides a flexible pipeline for data processing that allows automatic or custom-defined algorithms at every stage. Datashader makes it easier to reveal the underlying structure of the dataset and to focus on the specific aspects of interest.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

The web is becoming an increasingly important place to publish research findings, but JavaScript is a language that is broken by design, and Pythonistas seem particularly repelled by the language.

Flexx is a tool to create web apps, for which the client-side is completely implemented in Python and transpiled to JavaScript. It’s easy to extend Flexx’ functionality by writing Python classes, which will be demonstrated in this talk.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

What is the best way to develop analysis code in the Jupyter notebook, while managing complex dependencies between analyses? In this talk, I will introduce nbflow, which is a project that integrates a Python-based build system (SCons) with the Jupyter notebook, enabling researchers to easily build sophisticated, complex analysis pipelines entirely within notebooks while still maintaining a "one-button workflow" in which all analyses can be executed, in the correct order, from a single command. I will show how nbflow can be applied to existing analyses and how it can be used to construct an analysis pipeline stretching the entire way from data cleaning, to computing statistics, to generating figures, and even to automatically generating LaTeX commands that can be used in publications to format results without the risk of copy-and-paste error.

More info on nbflow here: https://github.com/jhamrick/nbflow

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Recent years have seen a widespread adoption of machine learning in industry and academia, impacting diverse areas from advertisement to personal medicine.

As more and more areas adopt machine learning and data science techniques, the question arises on how much expertise is needed to successfully apply machine learning, data science and statistics.

Not every company can afford a data science team, and going your PhD in biology, no-one can expect you to have PhD-level expertise in computer science and statistics.

This talk will summarize recent progress in automating machine learning and give an overview of the tools currently available. It will also point out areas where the ecosystem needs to improve in order to allow a wider access to inference using data science techniques.Finally we will point out some open problems regarding assumptions, and limitations of what can be automated.

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list....

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Tutorial materials may be found here: https://github.com/barbagroup/numba_t...

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Tutorial information may be found at https://github.com/scikit-image/skima...

Across domains, modalities, and scales of exploration, images form an integral subset of scientific measurements. Despite a deep appeal to human intuition, gaining understanding of image content remains challenging, and often relies on heuristics. Even so, the wealth of knowledge contained inside of images cannot be understated.

Scikit-image is an image processing library, built on top of SciPy, that provides researchers, practitioners, and educators access to a strong foundation upon which to build algorithms and applications.

In this tutorial, aimed at intermediate users of scientific Python, we introduce the library, give practical, real-world examples of applications, and briefly explore its use in the context of machine learning. Throughout, attendees are given the opportunity to learn through hands-on exercises.

Prerequisites: a working knowledge of NumPy arrays.

Capture2 thumb
Rating: Everyone
Viewed 2 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Materials for this tutorial may be found here: https://github.com/pydy/pydy-tutorial...

In this tutorial, attendees will learn how to derive, simulate, control, and visualize the motion of a multibody dynamic system with Python tools. These methods and techniques play an important role in the design and understanding of robots, vehicles, spacecraft, manufacturing machines, human motion, etc. In particular, the attendees will develop code to simulate the motion of a human balancing while standing.

This is an advanced tutorial and domain specific but we have found that a broad audience enjoys the topic. Attendees should be familiar with the basics of the SciPy stack, in particular NumPy, SciPy, SymPy, and IPython and have some familiarity with classical mechanics.

Details

In this tutorial, attendees will learn how to derive, simulate, and visualize the motion of a multibody dynamic system with Python tools. The tutorial will demonstrate an advanced symbolic and numeric pipeline for a typical multibody simulation problem. By the end, the attendees will have developed code to simulate the uncontrolled and controlled motion of a human balancing while standing.

We will highlight the derivation of realistic models of motion with the SymPy Mechanics package. Then we will cover code generation to create fast numerical functions that can be used to simulate the system. The simulation results will be plotted and visualized with a 3D WebGL browser based tool. Finally, we will use packages for optimal control to develop a controller that mimics human standing and visualize these results.

Capture2 thumb
Rating: Everyone
Viewed 8 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Materials for this tutorial may be found here: https://github.com/enthought/Numpy-Tu...

This course introduces the fundamental concepts for numerical calculation with NumPy. It provides scientists, engineers, and analysts a solid foundation for writing their own analyses and simulations in Python.
NumPy provides Python with a powerful array processing library and an elegant syntax that is well suited to expressing computational algorithms clearly and efficiently. We'll introduce basic array syntax and array indexing, review some of the available mathematical functions in numpy, and discuss how to write your own routines. Along the way, we'll learn just enough of matplotlib to display results from our examples.

More info on HDBSCAN here: https://github.com/lmcinnes/hdbscan.

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Dask is a pure Python library for parallel and distributed computing. Last year Dask parallelized NumPy and Pandas computations on multi-core workstations. This year we discuss using Dask to design custom algorithms and execute those algorithms efficiently on a cluster. This talk discusses Pythonic APIs for parallel algorithm development as well as strategies for intuitive and efficient distributed computing. We discuss recent results in machine learning and novel scientific applications.

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

The analysis of time series data is a fundamental part of many scientific disciplines, but there are few resources meant to help domain scientists to easily explore time course datasets: traditional statistical models of time series are often too rigid to explain complex time domain behavior, while popular machine learning packages deal almost exclusively with 'fixed-width' datasets containing a uniform number of features. Cesium is a time series analysis framework, consisting of a Python library as well as a web front-end interface, that allows researchers to apply modern machine learning techniques to time series data in a way that is simple, easily reproducible, and extensible.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Tutorial materias may be found at https://github.com/rouseguy/scipyUS20...

The current state-of-art technique for image recognition is deep learning. This workshop would cover some of the common deep learning architectures for image recognition, advantages and concerns along with hands-on implementing them using the latest deep learning libraries in Python. The main topics would be multi-layer perceptron, deep convolution networks and autoencoders. This workshop introduces artificial neural networks and deep learning. The building blocks of neural networks are discussed in detail. Attendees are introduced to learning using ANN along with backpropagation algorithm. A preliminary model using multi-layer perceptron is implemented to get a feel of the model structure and the deep learning library keras.

The workshop then proceeds to introduce the state-of-art convolution neural networks. The building blocks of CNN are explained and is implemented on the dataset to train the image recognition model and use it to test on unseen data. Overfitting is a big issue in deep learning. Some ways to overcome that are discussed and implemented.

We'll also show how GPU's affect the computation.
Unsupervised learning using autoencoders are introduced and implemented.

Capture2 thumb
Rating: Everyone
Viewed 4 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Our aim in this talk is to present the new features available in the next major version of Spyder, the Scientific PYthon DEvelopment EnviRonment. This version (Spyder 3.0) represents almost two years of development and brings important characteristics that we would like to introduce to the SciPy community. Among them we can find: the ability to
create and install third-party plugins, improved projects support, syntax highlighting and code completion for all programming languages supported by Pygments, a new file switcher
(similar to the one present in Sublime Text), code folding for the Editor, Emacs keybindings for the entire application, a Numpy array graphical builder, and a fully dark theme for the interface.

Capture2 thumb
Rating: Everyone
Viewed 11 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Tutorial materials for the Time Series Analysis tutorial including notebooks may be found here: https://github.com/AileenNielsen/Time...

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list....

Capture2 thumb
Rating: Everyone
Viewed 7 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Developer #lifehacks for the Jupyter Data Scientist.

Materials for this tutorial may be found here: https://github.com/drivendata/data-sc...

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 6 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Tutorial materials may be found here: https://github.com/bokeh/bokeh-notebooks

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list....

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Data clustering is a powerful tool for data analysis. It can be particularly useful in exploratory data analysis for helping to summarize and give intuition about a dataset. Despite it's power clustering is used for this task far less frequently than it could be. A plethora of options for clustering algorithms exist, and we will provide a survey of some of the more popular options, discussing their strengths and weaknesses, particularly with regard to exploratory data analysis. Our focus, however, is on a relatively new algorithm that appears to be the best equipped to meet the needs of exploratory data analysis: HDBSCAN* has the strengths of density based algorithms, has a small robust set of parameters, and with suitable implementation can be made highly scalable to large datasets. We will discuss how the algorithm works, taking a few different perspectives, and explain the techniques used for a high performance implementation. Finally we'll discuss ways to extend the algorithm, drawing on ideas from topological data analysis.

More info on HDBSCAN here: https://github.com/lmcinnes/hdbscan.

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 8 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Materials to follow along with the tutorial may be found at http://www.labri.fr/perso/nrougier/te... and here github.com/rougier/matplotlib-tutorial
After reviewing the main concepts for scientific figures creation (based on the "Ten simple rules for better figures" article), we will experience specifically the matplotlib library that provides many different types of high-quality figures with only a few lines of code. We'll go through the creation of a simple, but carefully crafted figure and see in the meantime the main concepts of the library. Then, we'll go through an animation example showing the last 50 earthquakes on the planet and we'll finish the tutorial with a set of exercises showing the main type of plots. Last, we'll have a look at available resources for advanced uses.

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

yt is a Python package designed for domain-specific inquiry of volumetric data, licensed under the BSD license and available at yt-project.org. Utilizing numerous components of the scientific Python ecosystem, it is able to ingest data from numerous different sources from domains such as astrophysics, nuclear engineering, weather and climate, oceanography, and seismology. Building on top of a parallelized framework for data selection, analysis, processing and visualization, inquiry can be driven based on relevant, physical quantities rather than those specific to data formats. I will describe recent advances in the yt 3.0 series, including support for particle, octree, patch and unstructured mesh datasets; interactive and batch volume rendering using both software and OpenGL backends; semantically-rich ontologies of fields, derived quantities and affiliated units (powered by sympy); user-defined kernel estimates for density; support for visualization in non-Cartesian domains; and a flexible chunking system for data IO. I will describe some of the non-astrophysics domains that yt has been applied to, and the infrastructure implemented to support that. Finally, I will describe the community-driven approach taken to designing, developing and implementing new features, and describe some of the challenges this has presented in the context of scientific software developers.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Materials for this tutorial are found here: https://www.eiseverywhere.com/file_up...

SymPy is a pure Python library for symbolic mathematics. It aims to become a full-featured computer algebra system (CAS) while keeping the code as simple as possible in order to be comprehensible and easily extensible. SymPy is written entirely in Python and does not require any external libraries.

This tutorial is intended to cover the basics as well as touch on more advanced topics. We will start by showing how to install and configure this Python module. Then we will proceed to the basics of constructing and manipulating mathematical expressions in SymPy. We will also discuss the most common issues and differences from other computer algebra systems, and how to deal with them. In the remaining part of this tutorial we will show how to solve mathematical problems with SymPy.

This knowledge should be enough for attendees to start using SymPy for solving mathematical problems and hacking SymPy's internals (though hacking core modules may require additional expertise).

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

HyperSpy is an open-source Python library that aims at easing the task of visualizing, analyzing, accessing and storing multi-dimensional signals. Such data structures arise in many scientific and engineering fields, from astronomy to electron microscopy. In addition, common non-linear optimization problems and our suggested new solutions will be presented.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Both Python and R boast large data science communities. Each have developed a fantastic collection of packages from reading/writing data to plotting and visualization. Unfortunately, some tools are only available in one language or the other, but not both. Python and R provide relatively simple mechanisms for interacting with C, C++, and Fortran. There are many tools that take advantage of this interoperability. While not a simple matter, developing data science tools in these low level languages and providing Python and R wrappers allows code reuse between languages, speed benefits notwithstanding. In this talk we will discuss strategies and lessons learned from porting existing packages to Python from R and writing cross language tools from scratch.

Capture2 thumb
Rating: Everyone
Viewed 8 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

This tutorial aims to provide an introduction to machine learning and scikit-learn "from the ground up". We will start with core concepts of machine learning, some example uses of machine learning, and how to implement them using scikit-learn. Going in detail through the characteristics of several methods, we will discuss how to pick an algorithm for your application, how to set its parameters, and how to evaluate performance.

Tutorial materials found here: https://github.com/amueller/scipy-201...

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 6 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Materials to follow along with the tutorial are available at: https://github.com/amueller/scipy-201...

This tutorial aims to provide an introduction to machine learning and scikit-learn "from the ground up". We will start with core concepts of machine learning, some example uses of machine learning, and how to implement them using scikit-learn. Going in detail through the characteristics of several methods, we will discuss how to pick an algorithm for your application, how to set its parameters, and how to evaluate performance.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Students will walk away with a high-level understanding of both parallel problems and how to reason about parallel computing frameworks. They will also walk away with hands-on experience using a variety of frameworks easily accessible from Python.

For the first half, we will cover basic ideas and common patterns encountered when analyzing large data sets in parallel. We start by diving into a sequence of examples that require increasingly complex tools. From the most basic parallel API: map, we will cover some general asynchronous programming with Futures, and high level APIs for large data sets, such as Spark RDDs and Dask collections, and streaming patterns. For the second half, we focus on traits of particular parallel frameworks, including strategies for picking the right tool for your job. We will finish with some common challenges in parallel analysis, such as debugging parallel code when it goes wrong, as well as deployment and setup strategies.

Part one: We dive into common problems with a variety of tools

1. Parallel Map
2. Asynchronous Futures
3. High Level Datasets
4. Streaming

Part two: We analyze common traits of parallel computing systems.

1. Processes and Threads. The GIL, inter-worker communication, and contention
2. Latency and overhead. Batching, profiling.
3. Communication mechanisms. Sockets, MPI, Disk, IPC.
4. Stuff that gets in the way. Serialization, Native v. JVM, Setup, Resource Managers, Sample Configurations
5. Debugging async and parallel code / Historical perspective

We intend to cover the following tools: concurrent.futures, multiprocessing/threading, joblib, IPython parallel, Dask, Spark

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Brian Granger is an Associate Professor of Physics at Cal Poly State University in San Luis Obispo, CA. He has a background in theoretical physics, with a Ph.D from the University of Colorado. His current research interests include quantum computing, parallel and distributed computing and interactive computing environments for scientific computing and data science. He is a leader of the IPython project, co-founder of Project Jupyter and is an active contributor to a number of other open source projects focused on data science in Python. He is a board member of the NumFocus Foundation and a fellow at Cal Poly’s Center for Innovation and Entrepreneurship. He is @ellisonbg on Twitter and GitHub.

Announcement of Altair, Altair is a declarative statistical visualization library for Python. Altair is developed by Brian Granger and Jake Vanderplas in close collaboration with the UW Interactive Data Lab.

With Altair, you can spend more time understanding your data and its meaning. Altair's API is simple, friendly and consistent and built on top of the powerful Vega-Lite JSON specification. This elegant simplicity produces beautiful and effective visualizations with a minimal amount of code.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Notebooks and other materials for this tutorial may be found here:
https://github.com/jonathanrocher/pan...

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Project Jupyter provides building blocks for interactive and exploratory computing. These building blocks make science and data science reproducible across over 40 programming language (Python, Julia, R, etc.). Central to the project is the Jupyter Notebook, a web-based interactive computing platform that allows users to author data- and code-driven narratives - computational narratives - that combine live code, equations, narrative text, visualizations, interactive dashboards and other media.

While the Jupyter Notebook has proved to be an incredibly productive way of working with code and data interactively, it is helpful to decompose notebooks into more primitive building blocks: kernels for code execution, input areas for typing code, markdown cells for composing narrative content, output areas for showing results, terminals, etc. The fundamental idea of JupyterLab is to offer a user interface that allows users to assemble these building blocks in different ways to support interactive workflows that include, but go far beyond, Jupyter Notebooks.

JupyterLab accomplishes this by providing a modular and extensible user interface that exposes these building blocks in the context of a powerful work space. Users can arrange multiple notebooks, text editors, terminals, output areas, etc. on a single page with multiple panels, tabs, splitters, and collapsible sidebars with a file browser, command palette and integrated help system. The codebase and UI of JupyterLab is based on a flexible plugin system that makes it easy to extend with new components.

In this talk, we will demonstrate the JupyterLab interface, its codebase, and describe how it fits within the overall roadmap of the project.

Slides for this talk: http://archive.ipython.org/media/SciP...
Blog post: http://blog.jupyter.org/2016/07/14/ju...
installation typo: conda install -c conda-forge jupyterlab

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list....

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 13, 2016
Date Posted: December 5, 2016

Tutorial materials may be viewed at http://darribas.org/gds_scipy16/
Installation instructions are found at http://darribas.org/gds_scipy16/conte...

This two-part tutorial will first provide participants with a gentle
introduction to Python for geospatial analysis, and an introduction to version PySAL 1.11 and the related eco-system of libraries to facilitate common tasks for Geographic Data Scientists.
The first part will cover munging geo-data and exploring relations over space. This includes importing data in different formats (e.g. shapefile, GeoJSON), visualizing, combining and tidying them up for analysis, and will use libraries such as `pandas`,
`geopandas`, `PySAL`, or `rasterio`. The second part will provide a gentle overview to demonstrate several techniques that allow to extract geospatial insight from the data. This includes spatial clustering and regression and point pattern analysis, and will use libraries such as `PySAL`, `scikit-learn`, or `clusterpy`.

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 14, 2016
Date Posted: December 5, 2016

Jupyter notebooks are JSON documents containing a combination of code, prose, and output. These outputs may be rich media, such as HTML or images. The use of JSON and including output can present challenges when working with version control systems and code review. The JSON structure significantly impedes the readability of diffs, and simple line-based merge tools can produce invalid results. nbdime aims to provide diff and merge tools specifically for notebooks. For diffs, nbdime will show rendered diffs of notebooks, so that the content can be compared efficiently, rather than the raw JSON. Merges performed with nbdime will guarantee a valid notebook as a result, even in the event of conflicts. nbdime integrates with existing tools, such as git, so you shouldn't need to change how you work. We hope to make the experience of collaborating on notebooks less painful and more fun.

Capture2 thumb
Rating: Everyone
Viewed 9 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

Using Jupyter notebooks and scikit-learn, you’ll predict whether a movie is likely to win an Oscar or be a box office hit. I’ll walk through the most important steps of creating an effective dataset using information that you find on the Internet: asking a question your data can answer, writing a web scraper, and answering those questions using nothing but Python libraries and data from the Internet. To illustrate how these steps fit together, I walk through building a dataset from IMDB data and use it to predict [what makes a winning Oscar movie](http://oscarpredictor.github.io/).

Capture2 thumb
Rating: Everyone
Viewed 4 times
Recorded at: July 14, 2016
Date Posted: December 5, 2016

Capture2 thumb
Rating: Everyone
Viewed 2 times
Recorded at: July 14, 2016
Date Posted: December 5, 2016

This talk showcases SymPy’s code generation capabilities. SymPy is a Python library that enables symbolic manipulation of mathematical expressions. Code generation is useful across a wide variety of domains. SymPy supports generating code for C, Fortran, Matlab/Octave, Python, Julia, Javascript, LLVM, Rust, and Theano. The code generation system is easily extensible to any language. Code generation supports a wide variety of expressions, including matrices. Code generation allows users to deal only with the high level mathematics of a problem, avoids mathematical errors and typos, makes it possible to deal with expressions that would otherwise be too large to write by hand, and opens possibilities to perform mathematical optimizations of expressions. SymPy’s code generation is used by libraries such as PyDy, chemreac, and sympybotics.

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 14, 2016
Date Posted: December 5, 2016

How do we communicate fundamental concepts in a reproducible, actionable form? How do we put numerical simulation tools in the hands of undergraduate students? These are questions we have been exploring in the development of GeoSci.xyz, a web-based resource in geophysics that leverages the geophysical software package SimPEG, Sphinx documentation, Jupyter notebooks and Binders to make examples and explanations that are reproducible and interactive.

Slides are available here: https://docs.google.com/presentation/...

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list....

Capture2 thumb
Rating: Everyone
Viewed 7 times
Recorded at: July 14, 2016
Date Posted: December 5, 2016

In 2000 David Scherer created Visual, a python package with a simple interface for drawing 3D objects to the screen. Visual abstracted away calls to OpenGL vertex drawing, textures, and transformations, and allowed primitive geometric objects to be placed on the screen in an intuitive manner. Visual was subsequently adopted by many researchers and university instructors, primarily in physics, to visualize scientific results and simulation assignments. The original Visual module used C++ to make OpenGL calls, and compatibility with Linux was never implemented. Python and OpenGL have come a long way in the past sixteen years, and now Visual can be made cross-platform by removing the C++ backend. I will present my attempt to re-implement Visual using pyglet, as well as demonstrate its API and usage in education and research visualization.

Capture2 thumb
Rating: Everyone
Viewed 7 times
Recorded at: July 14, 2016
Date Posted: December 5, 2016

Data exploration typically involves code for analyzing and
transforming a dataset together with separate code
used for visualization. This back and forth between tools
provides a serious bottleneck to getting a real grasp of the
data. In this talk we will demonstrate how the HoloViews library
lets you wrap datasets of any complexity and size,
making the data instantly visualizable in Jupyter Notebooks
to allow interactive exploration via widgets and various plotting
backends including matplotlib and bokeh.

Capture2 thumb
Rating: Everyone
Viewed 2 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

It's surprisingly easy to modify the CPython interpreter in some very useful ways. This talk will cover a simple assembly-technique which can allow live, hot-patched, `pip`-installable modifications to the CPython interpreter. The talk will cover five case studies of such modifications and how they can be used to extend the capabilities of Python, parsimoniously model abstract problems, and, in some cases, "unravel" complex APIs. This talk will cover:
- adding ast-literals to CPython, and how they can be used to model first-class-computation frameworks like `numexpr`
- decoupling evaluation scope from binding scope, and how this can add static correctness guarantees and user-defined literals
- embedding CPython interpreters within themselves, and how this can be used for same-process multiprocessing
- adding read watches, and how these can be used for lazy mechanisms, indirection mechanisms, and improved debugging
- adding the print_Statement back to CPython 3, and how this can be used to stop people from complaining about silly things

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

Multi-processing parallelism in Python might be unacceptable due to cache-inefficiency and memory overhead. On the other hand, multi-threaded parallelism with Python suffers from the GIL but when it comes to numeric computations, most of the time is spent in native codes where the GIL can easily be released. This is why modules such as Dask and Numba use multi-threading to greatly speed up the computations. But being used together in a nested way, e.g. when a Dask task calls Numba's threaded ufunc, it leads to the situation where there are more active software threads than available hardware resources. This situation is called over-subscription and it leads to inefficient execution due to frequent context switches, thread migration, broken cache-efficiency, and finally to a load imbalance when some threads finished their work but others are stuck along with the overall progress.

Another example is Numpy/Scipy when they are accelerated using Intel Math Kernels Library (MKL) like the ones shipped as part of Intel Distribution for Python. MKL is usually threaded using OpenMP which is known for not easily co-existing even with itself. In particular, OpenMP threads keep spin-waiting after the work is done -- which is usually necessary to reduce work distribution overhead for the next possible parallel region. But it plays badly with another thread pool because while OpenMP worker threads keep consuming CPU time in spin-waiting, the other parallel work like Numba's ufunc cannot start until OpenMP threads stop spinning or are preempted by the OS.

And the worst case is also connected to usage of OpenMP when a program starts multiple parallel tasks and each of these tasks ends up executing an OpenMP parallel region. This is quadratic over-subscription which ruins multi-threaded performance.

Our approach to solve these co-existence problems is to share one thread pool among all the necessary modules and native libraries so that one task scheduler will take care of this composability issue. Intel Threading Building Blocks (TBB) library works as such a task scheduler in our solution. TBB is a wide-spread and recognized C++ library for enabling multi-core parallelism. It was designed for composability, nested parallelism support, and avoidance of over-subscription from its early days. Thus we implemented a Python module which integrates TBB with Python, it is already available as part of Intel Distribution for Python and on Intel channel for conda users. I will show how to enable it for Numpy/Scipy, Dask, Numba, Joblib, and other threaded modules and demonstrate the performance benefits it brings.

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

This talk will focus on the use of Python, scikit-learn, NumPy, SciPy, and pandas in Data Science and machine learning with a focus on cyber anomaly detection. The presentation will focus on how Python facilitates all stages of such analysis including data gathering, analytics, and scaling to large data sets.

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

Geophysical inversions are tools for constructing models of the subsurface (images) given a finite amount of data. SimPEG (http://simpeg.xyz) is an effort to synthesize geophysical forward and inverse methodologies into a consistent framework. We will show seven geophysical methods based around a diamond exploration case study, combining the results to drive a more informed decision. Slides may be found here: https://docs.google.com/presentation/...

Capture2 thumb
Rating: Everyone
Viewed 10 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

In this talk, we present an approach for combining natural language processing with machine learning in order to explore the relationship between free text self-descriptions and demographics in OkCupid profile data. We discuss feature representation, clustering and topic modeling approaches, as well as feature selection and modeling strategies. We find that we can predict a user's demographic makeup based on their user essays, and we conclude by sharing some unexpected insights into deception.

Additional talk materials here:
https://github.com/juanshishido/okcupid
https://github.com/juanshishido/scipy...

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 14, 2016
Date Posted: December 5, 2016

At the Minnesota Supercomputing Institute we are exploring ways to provide the immediacy and flexibility of interactive computing within the batch-scheduled, tightly controlled world of traditional cluster supercomputing. As Jupyter Notebook has gained in popularity, the steps needed to use it within such an environment have proven to be a barrier to entry even as increasingly powerful Python tools have developed to take advantage of large computational resources. JupyterHub to the rescue! Except out of the box, it doesn't know anything about resource types, job submission, and so on. We developed BatchSpawner and friends as a general JupyterHub backend for batch-scheduled environments. In this talk I will walk through how we have deployed JupyterHub to provide a user-friendly gateway to interactive supercomputing.

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

Tutorial materials available at https://github.com/ericmjl/Network-An...
Graph analytics are an exciting new frontier in data science, and this tutorial will help you get up to speed on the basics.

In this tutorial, I will show you how you can use data to model data as a network, and use graph analysis methods to gain a rich understanding of that data. By the end of the tutorial, you will be equipped to think through network problems, and have enough familiarity with the networkx API to hack at them on your own. You will also a broad exposure to different examples where network properties (statistics & structures) can be useful for gaining insights into different data problems.

Capture2 thumb
Rating: Everyone
Viewed 9 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

Most scientists carefully collect data and select data resources. In a perfect world, we would have pristine, complete datasets. Yet, we are frequently challenged by incomplete and missing data. We are often taught to "ignore" missing data. In practice, however, ignoring the wrong types of data may build biases into our datasets, invalidating our conclusions. Here, we discuss three types of missing data (data missing completely at random, missing at random, and missing not at random) and heuristics for identifying and dealing with each type. Then we delve into an example, where we impute missing data for a simulator that utilizes reinforcement learning to predict effective HIV treatments. When we finish, you will know how to identify each of the three types of missing data and how to deal with each in your own projects.

Capture2 thumb
Rating: Everyone
Viewed 4 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

The goal of [SymEngine](https://github.com/symengine/symengine) is to be the
fastest C++ symbolic manipulation library (opensource or commercial), compatible with SymPy, that can be used from many languages (Python, Ruby, Julia, ...). We will present the current status of development, how things are implemented internally, why we chose C++, benchmarks, and examples of usage from Python (SymPy and Sage), Ruby and Julia.

Capture2 thumb
Rating: Everyone
Viewed 7 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

GT-Py is a newly developed just-in-time compiler that can offload NumPy code to hardware accelerators with relatively little programming effort. It lets programmers add pragmas to a Python program to specify what need to be offloaded, without writing the actual offloading code. By generating OpenCL code, GT-Py can run on a variety of accelerators including GPUs from different vendors, multicore CPUs, and potentially FPGAs. Experimental results demonstrate that significant performance gains, as much as over 9000x faster than the Python interpreter execution, can be obtained by adding only a couple of pragmas to the NumPy program. GT-Py supports both Python 2.7 and Python 3.4+. It will be available to public use for free.

Capture2 thumb
Rating: Everyone
Viewed 1 times
Recorded at: July 14, 2016
Date Posted: December 5, 2016

In this talk, we show how Python, Numba, and Dask can be used for GPU programming that easily scales from your workstation to a cluster, and can be controlled entirely from a Jupyter notebook. We will describe how the Numba JIT compiler can be used to create and compile GPU calculations entirely from the Python interpreter, and how the Dask task scheduling system can be used to farm these calculations out to a GPU cluster. Using an image processing example application, we will show how these two projects make it easy to iterate and experiment with algorithms on large data sets. Finally, we will conclude with tips and tricks for working with GPUs and distributed computing.

Capture2 thumb
Rating: Everyone
Viewed 1 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

Slides available at http://bit.ly/berkeley-ds-scipy-2016
At the University of California, Berkeley, an exciting and new Data Science Education Program (http://databears.berkeley.edu/) is running full-steam ahead. This presentation will provide an overview of the program, focusing on the Python-based Foundations of Data Science (DATA 8, http://data8.org/) course aimed at any and all freshmen. This material will be of interest to diverse folks thinking about data science education, using Jupyter notebooks in the classroom, and/or deploying and scaling JupyterHub. In this presentation, we’ll highlight student-facing content and provide an overview of our JupyterHub deployment. All these materials are publicly available on GitHub (https://github.com/data-8/). The slides for this talk are available at http://bit.ly/berkeley-ds-scipy-2016

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 1 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

The Mission Analysis, Operations, and Navigation Toolkit Environment (MONTE) is the Jet Propulsion Laboratory's (JPL) signature astrodynamic computing platform. It was built to support JPL's deep space exploration program, and has been used to fly robotic spacecraft to Mars, Jupiter, Saturn, Ceres, and many solar system small bodies. At its core, MONTE consists of low-level astrodynamic libraries that are written in C++ and presented to the end user as an importable Python language module. These libraries form the basis on which Python-language applications are built for specific astrodynamic applications, like trajectory design and optimization, orbit determination, flight path control, and more. This talk gives a brief history of the project, shows some examples of MONTE in action, and relates the stories of its greatest successes.

Capture2 thumb
Rating: Everyone
Viewed 1 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

Pip, wheels, and setuptools are the standard tools for installing, distributing, and building Python packages -- which means that if you're a user or package author then you're probably using them at least some of the time, even though when it comes to handling scientific packages, they've traditionally been a major source of pain. Fortunately, things have been getting better! In this talk, I'll describe how members of the scientific Python community have been working with upstream Python to solve some of the worst issues, and show you how to build and distribute binary wheels for Linux users, build Windows packages without MSVC, use wheels to handle dependencies on non-Python libraries like BLAS or libhdf5, plus give the latest updates on our effort to drive a stake through the heart of setup.py files and replace them with something better.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 14, 2016
Date Posted: December 5, 2016

Published on Jul 15, 2016
GR is a plotting package for the creation of two- and three-dimensional graphics in Python or Julia, offering unique plotting functions to visualize static or dynamic data with minimal overhead. In addition, GR can be used as a backend for other plotting interfaces or wrappers, in particular when being used in interactive notebooks. This presentation shows how visualization applications with special performance requirements can be designed on the basis of simple and easy-to-use functions as known from the MATLAB plotting library. Using quick practical examples, this talk is going to present the special features and capabilities provided by the GR framework both as a self-contained graphics library or as a fast backend for other packages. Slides may be found here: http://pgi-jcns.fz-juelich.de/pub/doc...

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 14, 2016
Date Posted: December 5, 2016

Data-parallel programming plays a significant role in HPC, for the
numerous applications that can leverage it and for the many parallel architectures that provide high performance for it. Literally, high performance computing means measuring, understanding, and improving performance as part of a scientific process, in which Python can be immensely helpful. Two key ingredients for this are just-in-time compilation, which enables run-time code generation, and transformation-based programming. After briefly exploring available programming models and abstractions, I will introduce and demonstrate PyOpenCL and Loopy, two complementary tools that help with all parts of this process. Unlocking good performance means experimenting with different algorithms, data layouts, approaches to parallelization. Conventionally, each of these requires a near-rewrite of the code under consideration. Loopy, by being based on transformations, entirely avoids this problem. Moreover, it separates application concerns from performance concerns, allowing the mathematical objective and its performant implementation to be expressed cleanly and separately. I will close with some examples that demonstrate the effectiveness of the approach.
See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

MDAnalysis (http://mdanalysis.org) is an object-oriented library for structural and temporal analysis of molecular dynamics (MD) simulation trajectories and individual protein structures. MD simulations of biological molecules have become an important tool to elucidate the relationship between molecular structure and physiological function. Simulations are performed with highly optimized software packages on HPC resources but most codes generate output trajectories in their own formats so that the development of new trajectory analysis algorithms is confined to specific user communities and widespread adoption and further development is delayed.

The MDAnalysis library addresses this problem by abstracting access to the raw simulation data and presenting a uniform object-oriented Python interface to the user. It thus enables users to rapidly write code that is portable and immediately usable in virtually all biomolecular simulation communities. The user interface and modular design work equally well in complex scripted workflows, as foundations for other packages, and for interactive and rapid prototyping work in IPython/Jupyter notebooks, especially together with molecular visualization provided by nglview [1] and time series analysis with pandas [2]. MDAnalysis is written in Python and Cython and uses NumPy arrays for easy interoperability with the wider scientific Python ecosystem. It is widely used and forms the foundation for more specialized biomolecular simulation tools. MDAnalysis is available under the GNU General Public License v2.

[1] https://github.com/arose/nglview
[2] http://pandas.pydata.org/

Slides for this talk are available here: https://github.com/MDAnalysis/scipy-2016

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...
Category
Science & Techn

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

We present the initial alpha release of QIIME 2, a Python 3 framework supporting interactive analysis and visualization of microbiomes on diverse high-performance computing resources; arbitrary interface development and platform integration; and a plugin system with automatic decentralized provenance tracking.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

We harnessed the power of three different computing platforms, Spark, Impala, and scientific python, to perform geospatial analysis on mobile phone users. We will discuss data processing techniques for comparing billions of user locations per day with millions of places of interest, easily extractible insights, and methodologies for estimating impacts of treatment on these movement patterns. Our workflow has potential for application for other use cases involving geospatial movement of populations.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

SPH (Smoothed Particle Hydrodynamics) is a general purpose technique to numerically compute the solutions to partial differential equations. The method is grid-free and uses particles to discretize the various properties of interest. The method is Lagrangian and particles are moved with the local
velocity. The method was originally developed for astrophysical problems (compressible gas-dynamics) but has since been extended to simulate incompressible fluids, solid mechanics, free-surface problems and a variety of other problems.

The SPH method is relatively easy to implement. This has resulted in a large number of schemes and implementations proposed by various researchers. It is often difficult to reproduce published results due to the variety of implementations. While a few standard packages like (SPHysics, DualSPHysics, JOSEPHINE etc.) exist, they are usually tailor-made for particular applications and are not general purpose.

Our group has been developing PySPH (http://pysph.bitbucket.org) over 5 years. PySPH is open source, and distributed under the new BSD license.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 14, 2016
Date Posted: December 6, 2016

The FOSSEE (Free Open Source Software for Science and Engineering Education)
project (http://fossee.in) is a large project funded by the Ministry of Human Resources and Development, MHRD, (http://mhrd.gov.in) of the Government of India. This project was started in 2009 as a pilot and is part of the MHRD's National Mission on Education through ICT (NMEICT). The NMEICT project is a $1 billion dollar initiative to improve the quality of education in India. As part of this project there have been several initiatives. One sterling example is the NPTEL project which provides content for over 900 courses at
the graduate and post-graduate level (400 web-based and 500 video-based) online. These are proving to be extremely useful over the country. Other projects include the Spoken Tutorial project (http://spoken-tutorial.org) which has also been previously presented at SciPy 2014. FOSSEE is one such project that is the outcome of the NMEICT funding.

The FOSSEE project is based out of IIT Bombay and has for its goal to eliminate the use of proprietary tools in the college curriculum. In order to do this the efforts are focused towards training students and teachers to use FOSS tools for their curricular activities. This also requires development efforts in order to either enhance existing projects or fill in any areas where FOSS tools are lacking. There are about 10+ PIs actively involved in various sub-projects. Some of the most active projects are Scilab, Python,
eSim (an EDA tool), OpenFOAM, Osdag (open source design of steel structures), etc. The website (http://fossee.in) has more details on each of these.

In this talk I will discuss the efforts of the Python group
(http://python.fossee.in). The Python group currently works on the following major activities:

- Creating Python Spoken tutorials that allow students to teach themselves Python.

- Supporting the creation of textbook-companions. A student picks a standard textbook used in her course and solves all the solved examples using Python. This is done in the form of IPython notebooks and is hosted by us. An honorarium is provided to the students. We have more than 350 textbook companions with many more on the way. These can be seen here:
http://tbc-python.fossee.in/ The IPython notebooks can also be edited
online by users.

- We have developed a simple online testing tool that allows an instructor to
setup programming tests online. https://github.com/FOSSEE/online_test

Capture2 thumb
Rating: Everyone
Viewed 6 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

We discuss a means by which item response theory (IRT), originally developed as a psychometric tool for assessing a
person's intellectual or academic ability given their
performance on a standardized test, can be used as a
data quality tool. Assuming that a dataset has an
underlying "ability" to train predictive models (where
the ability is specific to the type of dependent variable
being predicted), we build many models on top
of a variety of datasets to simultaneously assess the
best dataset for a given dependent variable as well
as which cases are the most "difficult" for a dataset
to predict correctly. The product of this work is an
understanding of both which predictions are the "hardest"
to get correct for any dataset, as well as which dataset
is expected to give the best predictions on a new
dependent variable.

The first step in this study is to build a laboratory in which many related models can be trained and validated, reproducibly and
in a self-documenting way. By running many models that
look at related dependent variables, for example, a number
of variables meant to predict different aspects of political
behavior, we can characterize a baseline expected performance
for any new model similar to those already built.
We call this suite of related models a market basket, after the
terminology and methodology used by economists to summarize
the state of a market.

Then, when we investigate new data sources or formats, we
have a well-defined process for determining whether the
new data makes the models better--we re-build our market
basket, and compare the results with the new data to the
results without it (performance, model build time,
data storage constraints) to assess the quality of our
data in a way that is driven by the models and data itself.

An interesting question is how to assess whether a given
dataset or feature is "better" for a given basket of models. An interesting idea comes to us from the field of psychometrics, which uses a set of tools called item response theory to assess exams (such as the SAT and GRE) and use exams to rank students by intellectual or academic ability.

Borrowing the terminology of IRT, we draw the analogy that a dataset is like a student (it has an inherent capability to accomplish certain tasks, like building good models), a single model prediction is a test question, and full set of test predictions is an exam. IRT parameterizes both the (unknown) student ability and the (also unknown) test question difficulty, and uses the EM algorithm to simultaneously solve for the parameters of both quantities at once. This allows a researcher to know both how "smart" a dataset is for solving a given basket of models, as well as rank-order "exam questions" (model predictions) by difficulty. The result is a single methodology with applications for both data quality and assessing the difficulty of making a given prediction (useful for e.g. outlier identification).

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

DyND is a library for dynamic, multidimensional arrays, targeting Python and C++ as initial first-class targets. Representing dynamic function calls is one of its cornerstones, and is a part of the library which has gone through many variations to converge on a design. In this talk, we will cover both the high level and low level details of the DyND callable, the object which encapsulates function calls.

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

We address the cybersecurity problems of supply chain risk management in open source software. How does one detect high-risk components in a deployed software system that includes many open source components? As a complement to software assurance approaches based on static source code analysis, we propose a technique based on an analysis of the entire open source ecosystem, inclusive of its technical products and contributor activity. we show how dependency topology, community activity, and exogenous vulnerability and exposure information can be integrated to detect high risk "hot spots" requiring additional investment. We demonstrate this technique using the Python dependency topology extracted from PyPi and data from GitHub. We will dicuss how our analysis prototype has been implemented with SciPy tools.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Jupyter notebooks and the Python ecosystem provide a unique opportunity for interactive, web-based, teaching of content that has not traditionally leveraged scientific computing resources. We discuss the design and implementation of a new biological signal processing course at Harvard, ES155, which fuses Wearable technology and cloud-based analysis of data. ES155 bridges the gap that has traditionally existed between Electrical Engineering and Computer Science education, in a framework that we term “Labs in the Wild”. In the process of designing the course, we have had to solve the problem of serving Jupyter notebooks on the cloud reliably using AWS EC2 instances. This is a challenging problem because a successful approach must be scalable, cost-effective, reliable, and address the privacy concerns associated with cloud-based technologies. We describe our system in this talk, and perform a live demo of how students in our class interact with the system, and give examples of ingenious final projects put together by students. Being cloud-based, our system lowers the barrier of entry for students to begin using Python for scientific computing.

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

In science the file system often serves as a de facto database, with directory trees being the zeroth-order scientific data structure. But it can be tedious and error prone to work directly with the file system to retrieve and store heterogeneous data sets. datreant makes working with directory structures and files Pythonic with Treants: specially marked directories with distinguishing characteristics that can be discovered, queried, and filtered. Treants can be manipulated individually and in aggregate, with mechanisms for granular access to the directories and files in their trees. Disparate data sets stored in any format (CSV, HDF5, NetCDF, Feater, etc.) scattered throughout a file system can thus be manipulated as meta-data sets of Treants. datreant is modular and extensible by design to allow specialized applications to be built on top of it, with MDSynthesis as an example for working with molecular dynamics simulation data. http://datreant.org/

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

We present a novel classification technique for identifying earthquake focal mechanism type and fault plane orientation using a robust classification technique rather than the least squares based (HASH) algorithm. The goal was to support a system capable of automatically classifying earthquakes, for applications such as microseismic monitoring. In this context, classification of both shear or/and tensile failure (mixed double-couple and CLVD sources) was required, so a generalized system was developed. More generally though, we see applications of this algorithm in hazard monitoring, particularly for early classification of tsunamigenic. The project was implemented in Python, the classification was made easy using scikit-learn and SciPy special functions, and 3-D visualization was done using Mayavi.

Capture2 thumb
Rating: Everyone
Viewed 6 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

The Psi4 open-source quantum chemistry project was written from the ground up with Python and C++ and will be used as an example on how to modernize and modularize programs that are typically decades old HPC Fortran programs. The Python interface allows novice users to quickly create complex instructions through the common Python syntax. In addition, developers gain access to tailored C++ libraries that allow entirely new methodologies to be written using only Python while simultaneously having the ability to run the process efficiently on tens to hundreds of thousands of processors.

Capture2 thumb
Rating: Everyone
Viewed 4 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Tutorial materials may be found here: https://github.com/mmckerns/tuthpc

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

We introduce “MPCite” which enables the continuous request, validation, and dissemination of Digital Object Identifiers (DOIs) for all inorganic materials currently available in the Materials Project (www.materialsproject.org). It provides our users with the necessary software infrastructure to achieve a new level of reproducibility in their research: It allows for the convenient and persistent citation of our materials data in online and print publications and facilitates sharing amongst collaborators. We also demonstrate how we extend the use of MPCite to non-core database entries such as theoretical and experimental data contributed through "MPContribs" or suggested by the user for calculation via the “MPComplete” service. We expect MPCite to be easily extendable to other scientific domains where the number of data records demands high-throughput and continuous allocation of DOIs.

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Binder (http://mybinder.org) is a service that bundles GitHub repositories with code, Jupyter notebooks, and data into reproducible, executable environments that can be launched instantaneously in the browser with the click of a button. Under the hood, Binder uses simple and flexible dependency specifications to build Docker images on demand, and then launches and schedules them across a public Kubernetes cluster. In this talk, I’ll describe in detail how Binder works, and highlight some exciting use cases. I’ll then describe several future directions for the project, including handling larger datasets, lowering barriers for environment specification, and supporting custom deployments with user-provided computing resources.

Capture2 thumb
Rating: Everyone
Viewed 2 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Friction plays a crucial role in a broad spectrum of natural and technological applications ranging from earthquakes to materials handling. Researchers working to understand frictional dynamics often develop their own software to solve specific problems with constitutive laws that include history and strain rate dependence, which has limited interdisciplinary comparison and community standards. We address these shortcomings with a Python implementation of the rate-and-state friction constitutive laws, including tools to handle multiple state variables, dynamic instability, and variations in friction rate dependence with slip velocity.

Capture2 thumb
Rating: Everyone
Viewed 1 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

We theoretical physicists love paper and blackboard, but computational analysis is also a good friend of us. I will guide through my journey during a project in string theory, from the formulation of a physics problem toward building a Python program with a web front-end, hoping that this will illustrate how theoretical physicists interested in Python programming and Python developers interested in physical and mathematical science can help each other.

Capture2 thumb
Rating: Everyone
Viewed 4 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Tensor computations are an important kernel in many high-performance domains such as quantum chemistry, statistics, machine learning, and others. We follow the example of the successful BLAS interface for matrix operations in defining a simple, low-level interface for tensor contraction and other operations, while providing a high-performance implementation using the BLIS framework. In this talk, the proposed "BLAS-like" tensor interface is discussed in the context of existing tensor and matrix abstractions, and performance data for tensor contraction is presented. Our tensor contraction implementation achieves similar performance to matrix multiplication and does not require any explicit tensor transposition or additional workspace, while also incorporating multithreading at several levels. These traits make our implementation ideal for layering underneath higher-level interfaces such as NumPy.

Capture2 thumb
Rating: Everyone
Viewed 9 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Conduit (http://software.llnl.gov/conduit) is a new open source project from Lawrence Livermore National Laboratory. It provides an intuitive model for describing hierarchical scientific data in C++, C, Fortran, and Python. Conduit supports in-core data coupling between packages, serialization, and I/O tasks.

Conduit leverages ideas from JSON and NumPy to provide a cross-language data access API that simplifies sharing data in the HPC ecosystem. For SciPy 2016, an important focus of our talk will be Python support in Conduit and how positive experiences using Python motivated our approach to build a sane cross-language data description solution.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Writing is notoriously hard, even for the best writers, and it's not for lack of good advice — a tremendous amount of knowledge is strewn across usage guides, dictionaries, technical manuals, essays, pamphlets, websites, and the hearts and minds of great authors and editors. But this knowledge is trapped, waiting to be extracted and transformed.

We built Proselint, a Python-based linter for prose. Proselint identifies violations of expert style and usage guidelines. Proselint is open-source software released under the BSD license and works with Python 2 and 3. It runs as a command-line utility or editor plugin (e.g., Sublime Text, Atom, Vim, Emacs) and outputs advice in standard formats (e.g., JSON). Though in its infancy – perhaps 2% of what it could be – Proselint already includes modules addressing: redundancy, jargon, illogic, clichés, sexism, misspelling, inconsistency, misuse of symbols, malapropisms, oxymorons, security gaffes, hedging, apologizing, pretension. Proselint can be seen as both a language tool for scientists and a tool for language science. On the one hand, it includes modules that promote clear and consistent prose in science writing. On the other, it measures language usage and explores the factors relevant to creating a useful linter.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Combustion simulations require detailed chemical kinetic models to predict fuel oxidation, heat release, and pollutant emissions.

These models are typically validated using qualitative rather than quantitative comparisons with limited sets of experimental data.

This work introduces PyTeCK, an open-source Python-based package for automatic testing of chemical kinetic models. Given a model of interest, PyTeCK automatically parses experimental datasets encoded in an XML format, validates the self-consistency of each dataset, and performs simulations for each experimental datapoint. It then reports a quantitative metric of the model's performance, based on the discrepancy between experimental and simulated values and weighted by experimental variance. The initial version of PyTeCK supports shock tube and rapid compression machine experiments that measure autoignition delay. PyTeCK relies on several packages in the SciPy stack and greater scientific Python ecosystem. In addition to providing an easy-to-use, automated tool for evaluating chemical kinetic model performance, a secondary objective of PyTeCK is to encourage greater openness and reproducibility in combustion research.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Workflows in computational chemistry must manage large quantities of expensively computed and metadata-laden data, each to be accessed in its own right or recycled into complex methodologies. QCDB manages such datasets in collection, application of recognized and exploratory work-up procedures through Pandas, visualization through matplotlib, and facilitation of open-access. It is demonstrated applied to a set of nonbonded structures from the protein databank (PDB).

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Emperor is a web enabled tool to manipulate and visualize large, high dimensional datasets through ordination plots, which allows the user to quickly compare samples from diverse environments, such as those from healthy subjects and C. difficile patients. Here, we introduce Emperor’s new and improved Python API, which tightly integrates with the scientific Python stack and places a particular emphasis on the Jupyter notebook environment. This integration allows interactive manipulation and visualization of microbiome datasets.

Capture2 thumb
Rating: Everyone
Viewed 1 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

For a data scientist building predictive models, the following are important:

1. How good is the model ?
2. How good is it compared to competing/alternate models?
3. Is there a way to identify what worked in the models built so far, to leverage it to build something even better?

The stakeholder/end-user who finally uses the output from the model, for whom the ML process is mostly black-box, is concerned with the following:

1. How to trust the model output?
2. How to understand the drivers?
3. How to do what-if analysis?

The unifying theme that could answer most of the above questions is visualization. The biggest challenge is to find a way to visualize the model, the model fitting process and the impact of drivers. This talk summarizes the learnings and key takeaways when communicating model results.

Capture2 thumb
Rating: Everyone
Viewed 4 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

We present the architecture of HistomicsTK, an open-source library for the processing and analysis of biomedical microscopy images. HistomicsTK leverages multiple SciPy libraries, and contains an innovative parameter serialization model to expose arbitrary CLI algorithms via a web GUI.

Capture2 thumb
Rating: Everyone
Viewed 4 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Published on Jul 14, 2016
Seismic Full-Waveform Inversion (FWI) is a field with decades-old academic codes. The program Fullwv started life in 1989, written in Fortran 77 by numerous researchers. We detail how we extended Fullwv and embedded a new Python-based FWI package called Zephyr. We override or replace portions of the Fortran source code with Python, to enable rapid development of new algorithms and methods.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

In this talk a Python-based simulation framework is described that implements the waveform level signal processing needed to acquire and track the ranging signal of a global positioning system (GPS) satellite. This framework was developed Fall 2015 as an end-of-semester project for a digital signal processing course taken by electrical engineers. By design, GPS signals lie on top of one another, but are separable by virtue of a unique code and nearly orthogonal code assigned to each satellite. The key to position determination is the time difference of arrival (TDOA) of each of the satellite signals at the user receiver. A high precision clock maintains timing accuracy among the satellites. One of the most important tasks of the user receiver is to acquire and track the ranging code of three or more satellites in view at a given time. The framework allows the user to first explore a receiver for a single satellite signal. Object oriented Python then makes it easy to extend the receiver to processing multiple satellite signals in parallel. The source of signals used in the framework is either simulation or a low-cost (~$20) software defined radio dongle known as the RTL-SDR. With the RTL-SDR signals are captured from a GPS patch antenna, fed to the RTL-SDR, and then via USB captured into Python as a complex ndarray. The computer simulation project that utilizes the framework has the students performing a variety of simulation tasks, start from a single channel receiver building up to a four channel receiver with signal impairments present. As developed Fall 2015 the project utilizing this framework is entirely computer simulation based, but the ability to use real signals captured from the RTL-SDR, opens additional capability options. Making use of these signals is non-trival, as additional signal processing is needed to estimate the Doppler frequency error and if the data bits are to be recovered, the L1 signal carrier phase needs to be tracked.

Capture2 thumb
Rating: Everyone
Viewed 3 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

In the last years Python has evolved into the primary language used in astronomy, astrophysics and cosmology. The flexible and dynamic nature of the language as well as the wide range of scientific libraries allow for fast prototyping and simple development of new applications. Unfortunately Python programs can be slower than native compiled languages such as C++ or Fortran by orders of magnitude, limiting the applications of Python in astronomical surveys where complex analyses have to be performed on large data sets.

In this talk I will present HOPE, a specialized Python just-in-time compiler for astrophysical applications that combines the ease of Python and the speed of C++. I lay out the architecture, concept and implementation of the package and show how we have significantly improved the performance of our simulations for astronomical surveys.

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Error and uncertainty are a fact of life, however they often go unaccounted for in geospatial models that rely on Digital Elevation Models (DEMs). Our work focuses on mitigating that problem as it relates to SRTM and ASTER GDEM. This is a discussion about how Python works within a broader system that enables researchers to work around this uncertainty without being geostatistics experts. The talk will cover how we use Python for advanced raster processing (including a vertical datum correction), calculating error statistics, and interacting with our spatial database as well as what we're doing with the end product, including a live demo.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

The ignition delay of a fuel/air mixture is an important quantity in designing combustion devices, and these data are also used to validate computational kinetic models for combustion. One of the typical experimental devices used to measure the ignition delay is called a Rapid Compression Machine (RCM). This work presents UConnRCMPy, an open-source Python package to process experimental data from the RCM at the University of Connecticut. Given an experimental measurement, UConnRCMPy computes the thermodynamic conditions in the reactor of the RCM during an experiment along with the ignition delay. UConnRCMPy relies on several packages from the SciPy stack and the broader scientific Python community. UConnRCMPy implements an extensible framework, so that alternative experimental data formats can be incorporated easily. In this way, UConnRCMPy improves the consistency of RCM data processing and enables reproducible analysis of the data.

Capture2 thumb
Rating: Everyone
Viewed 5 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

MetPy is an open-source Python package for meteorology, providing domain-specific tools for reading data, performing calculations, and visualizing data that we have recently started developing actively. In order to keep code working and in good shape, and minimize the accrual of technical debt, the project developers have focused on making use of many cloud services to automate the development of the project. These services accomplish: static code analysis, continuous integration testing, automated documentation builds, and automated releases. We will present our experiences of how these services have helped with development, as well as challenges that presented themselves, with the goal to encourage others to use those to support their own development efforts.

Slides for this talk are here: https://github.com/metpy/MetPy/blob/m...

See the complete SciPy 2016 Conference talk & tutorial playlist here: https://www.youtube.com/playlist?list...

Capture2 thumb
Rating: Everyone
Viewed 4 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

A Python-based platform for processing and analyzing data from core CT scans will be presented. This dataset is a high resolution 3D dataset of compositional and textural information. In raw form, this data contains artifacts and is in a form unsuitable for analysis. Once the data has been cleaned, it can be processed to detect features such as beds, laminae and dip angle. It can be combined with high resolution core photographs and well logs. Machine learning algorithms can use the CT data as a feature set to perform facies classification.

Capture2 thumb
Rating: Everyone
Viewed 0 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

We introduce a method to launch python applications at near native speed on large high performance computing systems.
The python run-time and other dependencies are bundled and delivered to computing nodes via a broadcast operation. The interpreter is instructed to use the local version of the files on the computing node, removing the shared file system as a bottleneck during the application start-up.
Our method can be added as a preamble to the traditional job script, improving the performance of user applications in a non-invasive way. Further more, our method allows us to implement a three-tier system for the supporting components of an application, reducing the overhead of runs during the development phase of an application. The method is used for applications on Cray XC30 and Cray XT systems up to full machine capability with an overhead typically less than 2 minutes. We expect the method to be portable to similar applications in Julia or R. We also hope the three tier system for the supporting components provides some insight for the container based solutions for launching applications in a development environment. We provide the full source code of an implementation of the method at url{https://github.com/rainwoodman/python...}. Given that large scale Python applications can be launch extremely efficiently on state of art super-computing systems, it is the time for the high performance computing community to seriously consider building complicated computational applications at large scale with Python.

Capture2 thumb
Rating: Everyone
Viewed 4 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016

Many open source, volunteer-driven projects begin with a small, tight-knit group of collaborators, but then rapidly expand far faster than anyone expects or plans for. I discuss cases of governance growing pains in Wikipedia, which have many lessons for running open source software projects. I discuss how Wikipedians have dealt with various issues as they have become one of the largest volunteer-based open collaboration projects, including the project’s growing bureaucracy and controversies between volunteers and professional staff.

Capture2 thumb
Rating: Everyone
Viewed 4 times
Recorded at: July 15, 2016
Date Posted: December 6, 2016