NumPy is the most fundamental package for scientific computing with Python. It adds to the Python language a data structure (the NumPy array) that has access to a large library of mathematical functions and operations, providing a powerful framework for fast computations in multiple dimensions. NumPy is the basis for all SciPy packages which extends vastly the computational and algorithmic capabilities of Python as well as many visualization tools like Matplotlib, Chaco or Mayavi.
This tutorial will teach students the fundamentals of NumPy, including fast vector-based calculations on numpy arrays, the origin of its efficiency and a short introduction to the matplotlib plotting library. In the final section, more advanced concepts will be introduced including structured arrays, broadcasting and memory mapping.
Machine Learning has been getting a lot of buzz lately, and many software libraries have been created which implement these routines. scikit-learn is a python package built on numpy and scipy which implements a wide variety of machine learning algorithms, useful for everything from facial recognition to optical character recognition to automated classification of astronomical images. This tutorial will begin with a crash course in machine learning and introduce participants to several of the most common learning techniques for classification, regression, and visualization. Building on this background, we will explore several applications of these techniques to scientific data -- in particular, galaxy, star, and quasar data from the Sloan Digital Sky Survey -- and learn some basic astrophysics along the way. From these examples, tutorial participants will gain knowledge and experience needed to successfully solve a variety of machine learning and statistical data mining problems with python.
HDF5 is a hierarchical, binary database format that has become a de facto standard for scientific computing. While the specification may be used in a relatively simple way (persistence of static arrays) it also supports several high-level features that prove invaluable. These include chunking, ragged data, extensible data, parallel I/O, compression, complex selection, and in- core calculations. Moreover, HDF5 bindings exist for almost every language - including two Python libraries (PyTables and h5py).
This tutorial will discuss tools, strategies, and hacks for really squeezing every ounce of performance out of HDF5 in new or existing projects. It will also go over fundamental limitations in the specification and provide creative and subtle strategies for getting around them. Overall, this tutorial will show how HDF5 plays nicely with all parts of an application making the code and data both faster and smaller. With such powerful features at the developer's disposal, what is not to love?!
This tutorial is targeted at a more advanced audience which has a prior knowledge of Python and NumPy. Knowledge of C or C++ and basic HDF5 is recommended but not required.
Matplotlib is one of the main plotting libraries in use within the scientific Python community. This tutorial covers advanced features of the Matplotlib library, including many recent additions: laying out axes, animation support, Basemap (for plotting on maps), and other tweaks for creating aesthetic plots. The goal of this tutorial is to expose attendees to several of the chief sub- packages within Matplotlib, helping to ensure that users maximize the use of the full capabilities of the library. Additionally, the attendees will be run through a 'grab-bag' of tweaks for plots that help to increase the aesthetic appeal of created figures. Attendees should be familiar with creating basic plots in Matplotlib as well as basic use of NumPy for manipulating data.
iPython notebooks used in the tutorial
This tutorial will give users an overview of the capabilities of statsmodels, including how to conduct exploratory data analysis, fit statistical models, and check that the modeling assumptions are met.
The use of Python in data analysis and statistics is growing rapidly. It is not uncommon now for researchers to conduct data cleaning steps in Python and then move to some other software to estimate statistical models. Statsmodels, however, is a Python module that attempts to bridge this gap and allow users to estimate statistical models, perform statistical tests, and conduct data exploration in Python. Researchers across fields such as economics and the social sciences to finance and engineering may find that statsmodels meets their needs for statistical computing and data analysis in Python.
All examples in this tutorial will use real data. Attendees are expected to have some familiarity with statistical methods.
With this knowledge attendees will be ready to jump in and use Python for applied statistical analysis and will have an idea how they can extend statsmodels for their own needs.
This tutorial is targeted at the intermediate-to-advanced Python user who wants to extend Python into High-Performance Computing. The tutorial will provide hands-on examples and essential performance tips every developer should know for writing effective parallel Python. The result will be a clear sense of possibilities and best practices using Python in HPC environments.
Many of the examples you often find on parallel Python focus on the mechanics of getting the parallel infrastructure working with your code, and not on actually building good portable parallel Python. This tutorial is intended to be a broad introduction to writing high-performance parallel Python that is well suited to both the beginner and the veteran developer.
We will discuss best practices for building efficient high-performance Python through good software engineering. Parallel efficiency starts with the speed of the target code itself, so we will first look at how to evolve code from for-loops to list comprehensions and generator comprehensions to using Cython with NumPy. We will also discuss how to optimize your code for speed and memory performance by using profilers.
The tutorial will cover some of the common parallel communication technologies (multiprocessing, MPI, and cloud computing) and introduce the use of parallel map and map-reduce.
At the end of the tutorial, participants should be able to write simple parallel Python scripts, make use of effective parallel programming techniques, and have a framework in place to leverage the power of Python in High- Performance Computing.
In this tutorial, I'll give a brief overview of pandas basics for new users, then dive into the nuts of bolts of manipulating time series data in memory. This includes such common topics date arithmetic, alignment and join / merge methods, resampling and frequency conversion, time zone handling, moving window functions like moving mean and standard deviation. A strong focus will be placed on working with large time series efficiently using array manipulations. I'll also illustrate visualization tools for slicing and dicing time series to make informative plots. There will be several example data sets taken from finance, economics, ecology, web analytics, or other areas.
The target audience for the tutorial includes individuals who already work regularly with time series data and are looking to acquire additional skills and knowledge as well as users with an interest in data analysis who are new to time series. You will be expected to be comfortable with general purpose Python programming and have a modest amount of experience using NumPy. Prior experience with the basics of pandas's data structures will also be helpful.
IPython provides tools for interactive and parallel computing that are widely used in scientific computing. We will show some uses of IPython for scientific applications, focusing on exciting recent developments, such as the network- aware kernel, web-based notebook with code, graphics, and rich HTML, and a high-level framework for interactive parallel computing.
This talk presents an overview and update of PySAL. PySAL is designed to support the development of high level applications in exploratory spatial data analysis and geocomputation. The library includes a comprehensive suite of modules that cover the entire spatial data analysis research stack from geospatial data processing and integration, to exploratory spatial data analysis, spatial dynamics, regionalization, and spatial econometrics. A selection of these modules are illustrated drawing on research in spatial criminology, epidemiology and urban inequality dynamics. A number of geovisualization packages that have been implemented using PySAL as an analytical core are also demonstrated. Future plans for additional modules and enhancements are also discussed.
In this talk, I'll describe the workings of my personal hobby project - a self-driving lego mindstorms robot! The body of the robot is built with Lego Mindstorms. An Android smartphone is used to capture the view in front of the robot. A user first teaches the robot how to drive; this is done by making the robot go around the track a small number of times. The image data, along with the user action is used to train a Neural Network. At run-time, images of what is in front of the robot are fed into the neural network and the appropriate driving action is selected. This project showcases the power of python's libraries, as they enabled me to put together a sophisticated working system in a very short amount of time. Specifically, I made use of the Python Image Library to downsample images, as well as the PyBrain neural network library. The robot was controlled using the nxt-python library. A high-level description + videos are available here: http://slowping.com/2012/self- driving-lego-mindstorms-robot/
In this talk I'll discuss major developments in pandas over the last year related to time series handling and processing. This includes the integration of the new NumPy datetime64, implementation of rich and high performance resampling methods, better visualization, and a generally cleaner, more intuitive and productive API. I will also discuss how functionality from the defunct scikits.timeseries project has been integrated into pandas, thus providing a unified, cohesive set of time series tools for many different problem domains. Lastly, I'll give some details about the pandas development roadmap and opportunities for more people to get involved.
MapReduce has become one of two dominant paradigms in distributed computing (along with MPI). Yet many times, implementing an algorithm as a MapReduce job - especially in Python - forces us to sacrifice efficiency (BLAS routines, etc.) in favor of data parallelism.
In my work, which involves writing distributed learning algorithms for processing terabytes of Twitter data at SocialFlow, I've come to advocate a form of "vectorized MapReduce" which integrates efficient numerical libraries like numpy/scipy into the MapReduce setting, yielding both faster per-machine performance and reduced I/O, which is often a major bottleneck. I'll also highlight some features of Disco (a Python/Erlang MapReduce implementation from Nokia) which make it a very compelling choice for writing scientific MapReduce jobs in Python.
The multi-layer shallow water equations are an active topic for researchers in geophysical fluid dynamics looking for ways to increase the validity of shallow water modeling techniques without using a fully three dimensional model which may be too costly for the domain size being looked at. In this talk we will step through the effort needed to convert a Fortran based solver to one using the PyClaw framework, a Python framework targeted at the solution of hyperbolic conservation laws. Once the application is converted the ease of implementing parallel and other solver strategies is greatly simplified. Discussion of how this is accomplished and design decisions and future extensions to PyClaw will also be presented.
In addition to bringing efficient array computing and standard mathematical tools to Python, the NumPy/SciPy libraries provide an ecosystem where multiple libraries can coexist and interact. This talk describes a success story where we integrate several libraries, developed by different groups, to solve our research problems. A brief description of our research and how we use these components follows.
Our research focuses on using Reinforcement Learning (RL) to gather information in domains described by an underlying linked dataset. For instance, we are interested in problems such as the following: given a Wikipedia article as a seed, finding other articles that are interesting relative to the starting point. Of particular interest is to find articles that are more than one-click away from the seed, since these articles are in general harder to find by a human.
In addition to the staples of scientific Python computing NumPy, SciPy, Matplotlib, and IPython, we use the libraries RL-Glue/RL-Library, NetworkX, Gensim, and scikit-learn.
Reinforcement Learning considers the interaction between a given environment and an agent. The objective is to design an agent able to learn a policy that allows it to maximize its total expected reward. We use the RL-Glue/RL-Library libraries for our RL experiments. This libraries provide the infrastructure to connect an environment and an agent, each one described by an independent Python program.
We represent the linked datasets we work with as graphs. For this we use NetworkX, which provides data structures to efficiently represent graphs together with implementations of many classic graph algorithms. We use NetworkX graphs to describe the environments implemented in RL-Glue/RL- Library. We also use these graphs to create, analyze and visualize graphs built from unstructured data.
One of the contributions of our research is the idea of representing the items in the datasets as vectors belonging to a linear space. To this end, we build a Latent Semantic Analysis (LSA) model to project documents onto a vector space. This allows us, in addition to being able to compute similarities between documents, to leverage a variety of RL techniques that require a vector representation. We use the Gensim library to build the LSA model. This library provides all the machinery to build, among other options, the LSA model. One place where Gensim shines is in its capability to handle big data sets, like the entire Wikipedia, that do not fit in memory. We also combine the vector representation of the items as property of the NetworkX nodes.
Finally, we also use the manifold learning capabilities of sckit-learn, like the ISOMAP algorithm, to perform some exploratory data analysis. By reducing the dimensionality of the LSA vectors obtained using Gensim from 400 to 3, we are able to visualize the relative position of the vectors together with their connections.
In summary, this talk shows, by combining a variety of libraries to solve our research problems, that the NumPy/SciPy ecosystem has become the lingua-franca of scientific Python computing.
As scientific computing pushes towards extreme scales, the programming wall is becoming more apparent. For algorithms to scale on new architectures, they often must be rewritten accounting for completely different performance characteristics. A handful of the communities fastest codes have already turned to automatic code generation to tackle these issues. Code generation gives a user the ability to use the expressiveness of a domain specific language and promises for better portability as architectures rapidly change.
In this presentation, I will show Ignition, a project for creating numerical code generators. Python and SymPy make exceptional languages for developing these code generators in a way that domain experts can understand and manipulate. I show examples how Ignition can generate several different parts of geophysical simulations.
The Unlock Project aims to provide brain-computer interface (BCI) technologies to individuals suffering from locked-in syndrome, the complete or near- complete loss of voluntary motor function. While several BCI techniques have been demonstrated as feasible in a laboratory setting, limited effort has been devoted to translating that research into a system for viable home use. This is in large part due to the complexity of existing BCI software packages which are geared toward clinical use by domain experts. With Unlock, we have developed a Python-based modular framework that greatly simplifies the time and programming expertise needed to develop BCI applications and experiments. Furthermore, the entire Unlock system, including data acquisition, brain signal decoding, user interface display, and device actuation, can run on a single laptop, offering exceptional portability for this class of BCI.
In this talk, I will present the Unlock framework, starting with a high-level overview of the system then touching on the acquisition, communication, decoding, and visualization components. Emphasis will be placed on the app developer API with several examples from our current work with steady-state visually evoked potentials (SSVEP).
Developers and users of astronomical software have long bemoaned the absence of shared efforts in the field. While there are well known, free software tools available for astronomy, most have been developed by large institutions, and the past few decades have seen comparatively little progress in fostering a community-based set of software tools. There is hope that is changing now. The continuing growth of Python in astronomy has led to an increasing awareness of needless duplication of efforts within the community and the need to make existing packages work better with each other; such discussions came to a head on the astropy email list in the spring of 2011 leading to formation of the astropy project. The first coordination meeting was held in the fall of 2011, and significant progress has been made in setting up a community repository of core astronomical packages. We will describe the general goals of astropy and the progress that has been made to date.
Copperhead is a data parallel language embedded in Python, which aims to provide both a productive programming environment as well as excellent computational efficiency on heterogeneous parallel hardware. Copperhead programs are written in a small, restricted subset of Python, using standard constructs like map and reduce, along with traditional data parallel primitives like scan and sort. Copperhead programs are written in standard Python modules and interoperate with existing Python numerical and visualization libraries such as NumPy, SciPy, and Matplotlib. The Copperhead runtime compiles Copperhead programs to target either CUDA-enabled GPUs or multicore CPUs using OpenMP or Threading Building Blocks. On several example applications from Computer Vision and Machine Learning, Copperhead programs achieve between 45-100% of the performance of hand-coded CUDA code, running on NVIDIA GPUs. In this talk, we will discuss the subset of Python that forms the Copperhead language, the open source Copperhead runtime and compiler, and selected example programs.
Julia is a dynamic language designed for technical applications and high performance. Its design is based on a sophisticated but unobtrusive type system, type inference, multiple dispatch instead of class-based OO, and a code generator based on LLVM. These features work together to run high-level code efficiently even without type declarations. At the same time, the type system provides useful expressiveness for designing libraries, enables forms of metaprogramming not traditionally found in dynamic languages, and creates the possibility of statically compiling whole programs and libraries. This combination of high performance and expressiveness makes it possible for most of Julia's standard library to be written in Julia itself, with an interface to call existing C and Fortran libraries.
We will discuss some ways that Python and Julia can interoperate, and compare Julia's current capabilities to Python and NumPy.
Python is currently being adopted as the language of choice by many astronomical researchers. A prominent example is in the Large Synoptic Survey Telescope (LSST), a project which will repeatedly observe the southern sky 1000 times over the course of 10 years. The 30,000 GB of raw data created each night will pass through a processing pipeline consisting of C++ and legacy code, stitched together with a python interface. This example underscores the need for astronomers to be well-versed in large-scale statistical analysis techniques in python. We seek to address this need with the AstroML package, which is designed to be a repository for well-tested data mining and machine learning routines, with a focus on applications in astronomy and astrophysics. It will be released in late 2012 with an associated graduate-level textbook, 'Statistics, Data Mining and Machine Learning in Astronomy' (Princeton University Press). AstroML leverages many computational tools already available available in the python universe, including numpy, scipy, scikit- learn, pymc, healpy, and others, and adds efficient implementations of several routines more specific to astronomy. A main feature of the package is the extensive set of practical examples of astronomical data analysis, all written in python. In this talk, we will explore the statistical analysis of several interesting astrophysical datasets using python and astroML.
ALGES is a laboratory that develops tools applied to geostatistics. We've been using python for a while and it has brought us very good results. Its ease-of- use and portability allow us to rapidly offer practical solutions to problems. Along with a brief introduction to the laboratory, we cover two particular projects we are currently working on. One project is an application for multivariate geostatistic analysis. Most available applications provide analysis for a single variable at a time and either obviate how variables can relate between one another or make it really difficult to consider any relationship. Our proposal provides both an interface that's both easy to use for primers and fine tuning for experienced users. The other presented project covers a problem in geological modeling and resource estimation. Commonly, when modeling geological volumes, continuity in data is assumed. This is not often true, as there are different kinds of faults that break this continuity. This is very hard to incorporate when modeling. We propose a solution to restore the original continuous volume for better modeling as well as restitution to the real distorted volume, all this providing a better estimation. Both projects have lots of heavy computations and no shortage of input data. We take this as a challenge to build fast processing solutions, so we take advantage of both the easiness of a python interface and the speed of C/C++ code.
Recent years have provided a wealth of projects showing that using Python for scientific applications outperforms even popular choices such as Matlab. A major factor driving these successes is the efficient utilization of multi- cores, GPUs for general-purpose computation and scaling computations to clusters.
However, often these advances sacrifice some of the high-productivity features of Python by introducing new language constructs, enforcing new language semantics and/or enforcing explicit data types. The result is that the user will have to rewrite existing Python applications to use the Python extension.
In order to use GPGPUs in Python, a popular approach is to embed CUDA/OpenCL code kernels directly in the Python application. The programming productivity of this approach is better and more readable than C/C++ applications but it is still inferior to native Python code. Furthermore, the approach enforces hardware specific programming and thus requires intimate knowledge of the underlying hardware and the CUDA/OpenCL programming model.
Copenhagen Vector Byte Code (cphVB) strives to provide a high-performance back-end for Numerical Python (NumPy) without reducing the high-productivity of Python/NumPy. Without any involvement of the user, cphVB will transform regular sequential Python/NumPy applications into high-performance applications. The cphVB runtime system is capable of utilizing a broad range of computing platforms efficiently, e.g. Multi-core CPUs, GPGPUs and clusters of such machines.
cphVB consists of a bridge that translates NumPy array operations into cphVB vector operations. The bridge will send these vector operations to a Vector Engine that performs the actual execution of the operations. cphVB comes with a broad range of Vector Engines optimized to specific hardware architectures, such as multi-core CPUs, GPGPU and clusters of said architectures. Thus, cphVB provides a high-productivity, high-performance framework that support legacy NumPy applications without changing a single line of code.
SciPy Sparse Graphs, Jake Vanderplas.
Animation for Traits and Chaco, Corran Webster.
Pynthantics, Jon Roland.
State of the Numba, Jon Riehl.
Pipe-o-matic call, Walker Hale.
A Command ND-Array, Frédéric Bastien.
Travis Oliphant (Continuum Analytics), Kurt Smith (Enthought) and Jeff Bezanson (MIT, Julia author) discuss Python performance issues. Andy Terrel (UT/TACC) is the moderator.
The most common programming paradigm for scientific computing, SPMD (Single Program Multiple Data), catastrophically interacts with the loading strategies of dynamically linked executables and network-attached file systems on even moderately sized high performance computing clusters. This difficulty is further exacerbated by "function-shipped" I/O on modern supercomputer compute nodes, preventing the deployment of simple solutions. In this talk, we introduce a two-component solution: collfs, a set of low-level MPI-collective file operations that can selectively shadow file system access in a library, and walla, a set of Python import hooks for seamlessly enabling parallel dynamic loading scalable to tens of thousands of cores.
While Numpy/Scipy is an attractive implementation platform for many algorithms, in some cases C++ is mandated by a customer. However, a foundation of numpy's behavior is the notion of reference-counted instances, and implementing an efficient, cross-platform mechanism for reference counting is no trivial prerequisite.
The reference counting mechanisms already implemented in the Qt C++ toolkit provide a cross-platform foundation upon which a numpy-like array class can be built. In this talk one such implementation is discussed, QNDArray. In fact, by mimicking the numpy behaviors, the job of implementing QNDArray became much easier, as the task of "defining the behavior" became "adopting the behavior," to include function names.
In particular, the following aspects of the implementation were found to be tricky and deserve discussion in this presentation:
slicing multidimensional arrays given the limitations of operator in C++,
implicit vs. explicit data sharing in Qt QNDArray has been deployed in scientific research applications and currently has the following features:
bit-packed boolean arrays
nascent masked array support
unit test suite that validates QNDArray behavior against numpy behavior
bounds checking with Q_ASSERT() (becomes a no-op in release mode)
memmap()ed arrays via QFile::map()
easily integrated as a QVariant value, leading to a natural mapping from QVariantMap to Python dict.
float16 implementation including in-place compare
The author has approval from his management to submit the source code for QNDArray to the Qt Project and plans to have it freely available for download via http://qt.gitorious.org/ before the SciPy conference begins.
The usage of the high-level scripting language Python has enabled new mechanisms for data interrogation, discovery and visualization of scientific data. We present yt ( http://yt-project.org/ ), an open source, community-developed astrophysical analysis and visualization toolkit for both post-processing and in situ analysis of data generated by high-performance computing (HPC) simulations of astrophysical phenomena. We report on successes in astrophysical computation through development of analysis tasks, visualization, cross-code compatibility, and community building.
We introduce CnC-Python (CP), an approach to implicit multicore parallelism for Python programmers based on a high-level macro data-flow programming model called Concurrent Collections (CnC). With the advent of the multi-core era, it is clear that improvements in application performance will primarily come from increased parallelism. Extracting parallelism from applications often involves the use of low-level primitives such as locks and threads. CP is implicitly parallel and enables programmers to achieve task, data and pipeline parallelism in a declarative fashion while only being required to describe the program as a coordination graph with serial Python code for individual nodes (steps). Thus, CP makes parallel programming accessible to a broad class of programmers who are not trained in parallel programming. The CP runtime requires that Python objects communicated between steps be picklable, but imposes no restriction on the Python idioms used within the serial code. Most data structures of interest to the SciPy community, including NumPy arrays, are included in the class of picklable data structures in Python.
The CnC model is especially effective in exploiting parallelism in scientific applications in which the dependences can be represented as arbitrary directed acyclic graphs ("dag parallelism"). Such applications include, but are not limited to, tiled implementations of iterative linear algebra algorithms such as Cholesky decomposition, Gauss-Jordan elimination, Jacobi method, and Successive Over-Relaxation (SOR). Rather than using explicit threads and locks to exploit parallelism, the CnC-Python programmer decomposes their algorithm into individual computation steps and identifies data and control dependences among the steps to create such computation DAGs. Given the DAG (in the form of declarative constraints), it is the responsibility of the CP runtime to extract parallelism and performance from the application. By liberating the scientific programmer, who is not necessarily trained to write explicitly parallel programs, from the nuances of parallel programming, CP provides a high-productivity path for scientific programmers to achieve multi-core parallelism in Python.
LINKS: CnC-Python: http://cnc-python.rice.edu Concurrent Collections: http://habanero.rice.edu/cnc
As astronomical software development expands from historical data reduction platforms towards more sophisticated Python applications, non-technically- focused users can struggle with installing and maintaining a large number of heterogeneous dependencies. PyRAF has successfully bridged the gap between IRAF and Python, but managing dependencies falls outside its scope. A few existing Python distributions make installation easy, but don't cater for specific needs (such as dependence on IRAF). STScI and Gemini have therefore developed a prototype, easy-to-install software distribution for Linux and OSX known provisionally as the 'Unified Release' (UR).
Currently the UR includes STScI Python and its dependencies (eg. Python, NumPy, IRAF 2.15), as well as Matplotlib & Tk, SciPy, a number of IRAF packages, DS9, X11IRAF and some testing and documentation tools. Its scope extends to complementary non-Python/IRAF software, but we do not intend to produce a comprehensive (Scisoft-like) distribution of tools for astronomy, nor to satisfy every installation preference. Our focus is on providing a simple way to run key tools, for users with minimal support resources and who may not have administrative privileges. Unlike most comparable distributions, our approach includes basic provision for in-place software additions and updates.
Recently we have completed a first internal version of the UR for both Linux and OSX, which we shall briefly demonstrate. We plan to make our first public release during the coming months.
Here, include a talk summary of no longer than 500 words. Aspects such as relevance to Python in science, applicability, and novelty will be considered by the program committee.
In most large-scale computations, systems of equations arise in the form Au=b, where A is a linear operation to be performed on the unknown data u, producing the known right-hand-side, b, which represents some constraint of known or assumed behavior of the system being modeled. Since u can have a many millions to billions elements, direct solution is too slow. A multigrid solver solves partially at full resolution, and then solves directly only at low resolution. This creates a correction vector, which is then interpolated to full resolution, where it corrects the partial solution.
This project aims to create an open-source multigrid solver library, written only in Python. The existing PyAMG multigrid implementation–a highly versatile, highly configurable, black-box solver–is fully sequential, and is difficult to read and modify due to its C core. OpenMG is a pure Python experimentation environment for developing multigrid optimizations, not a new production solver library. By making the code simple and modular, we make the alogrithmic details clear. We thereby create an opportunity for education and experimental optimization of the partial solver (Jacobi, Gauss Seidel, SOR, etc.), the restriction mechanism, the prolongation mechanism, and the direct solver, using GPGPU, multiple CPUs, MPI, or grid computing. The resulting solver is tested on an implicit pressure reservoir simulation problem with satisfactory results.
SymPy is a symbolic algebra package for Python. In SymPy.Stats we add a stochastic variable type to this package to form a language for uncertainty modeling. This allows engineers and scientists to symbolically declare the uncertainty in their mathematical models and to make probabilistic queries. We provide transformations from probabilistic statements like $P(XY > 3)$ or $E(X*2)$ into deterministic integrals. These integrals are then solved using SymPy's integration routines or through numeric sampling.
This talk touches on a few rising themes:
The rise in interest in uncertainty quantification and
The use of symbolics in scientific computing
Intermediate representation layers and multi-stage compilation
Historically solutions to uncertainty quantification problems have been expressed by writing Monte Carlo codes around individual problems. By creating a symbolic uncertainty language we allow the expression of the problem-to-be- solved to be written separately from the numerical technique. SymPy.stats serves as an interface layer. The statistical programmer doesn't need to think about the details of numerical techniques and the computational methods programmer doesn't need to think about the particular domain-specific questions to be solved.
We have implemented multiple comptuational backends including purely symbolic (using SymPy's integration engine), sampling, and code generation.
In the talk we discuss these ideas with a few illustrative examples taken from basic probability and engineering. The following is one such example
Numba is a Python bytecode to LLVM translator that allows creation of fast, machine code from Python functions. The Low Level Virtual Machine (LLVM) project is rapidly becoming a hardware-industry standard for the intermediate representation (IR) of compiled codes. Numba's high-level translator to the LLVM IR provides Python the ability to take advantage of the machine code generated by the hardware manufacturers contributions to LLVM. Numba translates a Python function comprised of a subset of Python syntax to machine code using simple type inference and the creation of multiple machine-code versions. In this talk, I will describe the design of Numba, illustrate its applications to multiple domains and discuss the enhancements to NumPy and SciPy that can benefit from this tool.
Current parallel programming models leave a lot to be desired and fail to maintain pace with improvements in hardware architecture. For many scientific research groups these models only widen the gap between equations and scalable parallel code. The Resilient Optimizing Flow Language (ROFL) is a data-flow language designed with the purpose of solving the problems of both domain abstraction and efficient parallelism. Using a functional, declarative variant of the Python language, ROFL takes scientific equations and optimizes for both scalar and parallel execution.
ROFL is closely tied to Python and the SciPy libraries. ROFL uses Python expression syntax, is implemented in Python, and emits optimized Python code. ROFL's implementation in Python allows ROFL to be embedded in Python. Using Python as a target language makes ROFL extensible and portable. By removing imperative loop constructs and focusing on integration with the NumPy and SciPy libraries, ROFL both supports and encourages data parallelism.
In this presentation, we introduce the ROFL language, and demonstrate by example how ROFL enables scientists to focus more on the equations they are solving, and less on task and data parallelism.
Python has been adopted by many disciplinary communities, showing its adaptability to many problems. Scientific computing and web development are two examples of such communities. These might, at first glance, seem to share few common interests, especially at the level of algorithms and libraries. However, at the level of integrated practice in time-constrained academic environments, where framework development is less valued than research and teaching productivity, ease of adoption of tools from each of these communities can be tremendously valuable.
Using examples from the recently-deployed West Texas Lightning Mapping Array, which is processed and visualized in real-time, this paper will argue that a shared sense, among disciplinary communities, of the essence of how one deploys Python for specific problems is beneficial for continuation and growth of Python's status as a go-to language for practitioners in academic settings.
FLASH is a high-performance computing (HPC) multi-physics code which is used to perform astrophysical and high-energy density physics simulations. It runs on the full range of systems from laptops to workstations to 100,000 processor super computers - such as the Blue Gene/P at Argonne National Laboratory.
Historically, FLASH was born from a collection of unconnected legacy codes written primarily in Fortran and merged into a single project. Over the past 13 years major sections have been rewritten in other languages. For instance, I/O is now implemented in C. However building, testing, and documentation are all performed in Python.
FLASH has a unique architecture which compiles simulation specific executables for each new type of run. This is aided by an object-oriented- esque inheritance model that is implemented by inspecting the file system's directory hierarchy. This allows FLASH to compile to faster machine code than a compile-once strategy. However it also places a greater importance on the Python build system.
To run a FLASH simulation, the user must go through three basic steps: setup, build, and execution. Canonically, each of these tasks are independently handled by the user. However, with the recent advent of flmake - a Python workflow management utility for FLASH - such tasks may now be performed in a repeatable way.
Previous workflow management tools have been written for FLASH. (For example, the "Milad system" was implemented entirely in Makefiles.) However, none of the priorattempts have placed reproducibility as their primary concern. This is in part becausefully capturing the setup metadata requires alterations to the build system.
The development of flmake started by rewriting the existing build systemto allow FLASH to be run outside of the main line subversion repository. It separates outproject and simulation directories independent of the FLASH source directory. Thesedirectories are typically under their own version control.
Moreover for each of the important tasks (setup, build, run, etc), a sidecar metadata description file is either written or appended to. This is a simple dictionary-of-dictionaries JSON file which stores the environment of the system and the state of the code when each flmake command is run. This metadata includes the version information of both the FLASH main line and project repositories. However, it also may include all local modifications since the last commit. A patch is automatically generated using the Python standard library difflib module and stored directly in the description.
Along with universally unique identifiers, logging, and Python run control files, the flmake utility may use the description files to fully reproduce a simulation by re-executing each command in its original environment and state. While flmake reproduce makes a useful debugging tool, it fundamentally increases the scientific merit of FLASH simulations.
The methods described above may be used whenever source code itself is distributed. While this is true for FLASH (uncommon amongst compiledcodes), most Python packages also distribute their source. Therefore the same reproducibility strategy is applicable and highly recommended for Python simulation codes. Thus flmake shows that reproducibility - which is notably absent from most computational science projects - is easily attainable using only version control and standard library modules.
The drive to publish often leaves scientists working with old, inflexible, poorly documented dead end software. Even operational systems can end up being a mash of legacy systems cobbled together. As the Atmospheric Radiation Measurement (ARM) Climate Facility brings its 30+ cloud and precipitation sensitive radars into operation a concerted effort to modernize, modularize and adapt existing code and write new code to retrieve geophysical parameters from the remotely sensed signals. Due to the open nature, active development community and lack of licensing issues Python is a natural development environment choice. This presentation will outline the challenges involved in retrieving model comparable geophysical parameters from scanning weather radars, introduce the framework behind the Python ARM Radar Toolkit (Py-ART) and discuss the challenges involved in building high performance code while maintaining portability, readability and ease of use.
We present a toolbox for the creation and study of controllers for hybrid systems. It contains modules for
working with n-dimensional polytopes,
refining continuous state space partitions to satisfy reachability properties,
synthesizing, manipulating, and visualizing finite automata as winning strategies for a class of temporal logic-based games,
simulating hybrid executions, and
reading and writing problem solutions to an XML format.
The toolbox is named TuLiP (for "Temporal Logic Planning") and written almost entirely in Python, making critical use of NumPy, SciPy, CVXOPT, and matplotlib. While software for hybrid systems research is commonly written in Matlab scripts or otherwise requires the end-user to build from source for her particular platform, TuLiP requires neither. For a standard scientific Python environment, the only additional library may be CVXOPT. Code (re)use and experimentation are easy, and because of this, TuLiP has provided a natural basis for further research and development.
Source code and documentation are currently available at http://tulip- control.sourceforge.net
In this talk we will describe the problem domain addressed by TuLiP, various use cases, and lessons learning in the Python implementation. We shall include a full example making use of all components and show ways that individual modules are useful more broadly. Major items of the talk will be
related work, and the paucity of Python use in hybrid control research, which we argue is a matter of inheritance rather than best practices;
overview of the type of hybrid systems represented in TuLiP and relevance to other fields;
summary of the major steps going from problem statement to solution;
using only the "polytope computations" module;
using only "discrete reactive synthesis" related modules, with a brief description about temporal logic synthesis to provide background for those not working on computer aided verification;
snippets about recent research using and building on TuLiP; and
discussion about the Python-based implementation and lessons learned.
For the last item, we will describe challenges faced while developing TuLiP, given its role of "stitching together" several external tools, e.g., Gephi for large graph visualization and gr1c for game solving. We will also touch on liberation from a Matlab-only tool (Mult- Parametric Toolbox; see http://control.ee.ethz.ch/~mpt/), achieved by creating our own Python module for working with polytopes, using NumPy and CVXOPT for computations and matplotlib for visualization.
A tool paper describing an earlier version of TuLiP was presented at the conference Hybrid Systems: Computation and Control (HSCC) in April 2011. There have since been substantial additions and improvements. Furthermore, a broader audience can be reached at SciPy 2012, with new opportunity to address designs issues likely shared by other scientific Python developers.
Development of TuLiP has been supported in part by the AFOSR through the MURI program, the Multiscale Systems Center (MuSyC) and the Boeing Company.
We present QuTiP, an object-oriented open-source framework for solving the dynamics of open quantum systems. The QuTiP framework is written in a combination of Python and Cython, and using SciPy, NumPy and matplotlib to provide an environment for computational quantum mechanics that is easy and efficient to use. Arbitrary quantum systems, including time-dependent systems, may be built up from operators and states defined by a quantum object class, and then passed on to a choice of unitary and dissipative evolution solvers. We give an overview of the basic structure for the framework and the techniques used in its implementation. We also present a few selected examples from contemporary research on quantum mechanics that illustrate the strengths of the framework, and the types of calculation that can be performed. The framework described here is particularly well suited to the fields of quantum optics, superconducting circuit devices, nanomechanics, and trapped ions, while also being ideal as an educational tool.
For more information see http://qutip.googlecode.com.
VisIt is an open source, turnkey application for scientific data analysis and visualization that runs on a wide variety of platforms from desktops to petascale class supercomputers. This talk will provide an overview of Python’s role in VisIt with a focus on use cases of scripted rendering, data analysis, and custom application development.
Python is the foundation of VisIt’s primary scripting interface, which is available from both a standard python interpreter and a custom command line client. The interface provides access to all features available through VisIt’s GUI. It also includes support for macro recording of GUI actions to python snippets and full control of windowless batch processing.
While Python has always played an important scripting role in VisIt, two recent development efforts have greatly expanded VisIt’s python capabilities:
We recently enhanced VisIt by embedding python interpreters into our data flow network pipelines. This provides fine grained access, allowing users to write custom algorithms in python that manipulate mesh data via VTK’s python wrappers and leverage packages such as numpy and scipy. Current support includes the ability to create derived mesh quantities and execute data summarization operations.
We now support custom GUI development using Qt via PySide. This allows users to embed VisIt’s visualization windows into their own python applications. This provides a path to extend VisIt’s existing GUI and for rapid development of streamlined GUIs for specific use cases.
The ultimate goal of this work is to evolve Python into a true peer to our core C++ plugin infrastructure.
This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL- ABS-552316).
The National Oceanic and Atmospheric Administration's (NOAA) Hazardous Weather Testbed (HWT) is a facility jointly managed by NOAA's National Severe Storms Laboratory (NSSL), NOAA National Weather Service's (NWS) the Storm Prediction Center (SPC), and the NOAA NWS Oklahoma City/Norman Weather Forecast Office (OUN) within the National Weather Center building on the University of Oklahoma South Research Campus. The HWT is designed to accelerate the transition of promising new meteorological insights and technologies into advances in forecasting and warning for hazardous weather events throughout the United States. The HWT facilities include a combined forecast and research area situated between the operations rooms of the SPC and OUN, and a nearby development laboratory. The facilities support enhanced collaboration between research scientists and operational weather forecasters on specific topics that are of mutual interest.
The cornerstone of the HWT is the yearly Experimental Forecast Program (EFP) and Experimental Warning Program (EWP) which take place every spring. In each of those programs, forecasters, researchers, and developers come together to participate in a real-time operational forecasting or warning environment with the purpose of testing and evaluating cutting-edge tools and methods for forecasting and warning. In the EFP program, between 5 and 10 TB of meteorological data are processed for evaluation over the course of a 5 week period. These data come in a variety of sources, a variety of formats, each requiring a different set of processing.
This talk will discuss how the data flow and data creation processes of the EFP are accomplished in a real-time setting through the use of Python. The utilization of Python ranges from simple shell scripting, to speeding up algorithm development (and runtimes) with Numpy and Cython, to creating new, open source data-visualization platforms, such as the Skew-T and Hodograph Analysis and Research Program in Python, or SHARPpy.
There exist two very powerful geometric surface subdivision schemes that do not yet exist for python users: Catmul-Clark subdivision surfaces, and Nira Dyn's Butterfly subdivision surface scheme. These schemes are useful in creating C2-continuous (under ideal conditions) surfaces from a control mesh of points. The later scheme interpolates the control mesh points, which makes it useful for scientific applications.
We plan on providing some background on the schemes detailing usecases and visualizing the results. We also plan on discussing the various techniques we use to overcome performance bottlenecks (numpy/cython/etc.)
Scalable Python, Travis Oliphant.
Big Data in the Cloud with Python, Chris Cope.
CMake and Cython, Matt McCormick.
Psychometric Python, Mark Moulton.
Evolutionary Comp. in Python, Alan Lockett.
Generative Art with Neural Networks, Byron Galbraith.
Cellulose Based Serialization, Matt Terry.
NumFocus, Fernando Perez.
Software Carpentry, Matt Davis.
As a scientific Python application grows, it can be increasingly difficult to use and maintain, because of implicit assumptions made when writing each component. Users can pass any possible data type for any argument, so code either fills up with assertions and tests to see what type of data has been supplied, or else has undefined behavior for some datatypes or values. Once software is exchanged with other users, obscure error messages or even incorrect results are the likely outcome. Programming languages that require types to be declared alleviate some of these issues, but are inflexible and difficult to use, both in general and when specifying details of types (such as ranges of allowed values). Luckily, Python metaobjects make it possible to extend the Python language to offer flexible declarative typing, offering the best of both worlds.
The Param module provides a clean, low-dependency, pure-Python implementation of declarative parameters for Python objects and functions, allowing library and program developers to specify precisely what types of arguments or values are allowed. A Parameter is a special type of class attribute that supports type declarations (based on subtypes of a specified class, support for specified methods (duck typing), or any other criterion that can be tested), ranges, bounds, units, constant values, and enumerations. A Parameter has a docstring (visible at the command line or in generated documentation), inherits its default value, documentation, etc. along the class hierarchy, and can be set to dynamic values that generate a stream of numbers for use in controlling scientific code. In essence, a Parameter is a Python attribute extended to support clean, simple, robust, maintainable, and declarative scientific programming.
Param has been under continuous development and use since 2002 as part of the Topographica simulator (topographica.org), but is now being released as a separate package due to demand from users who want similar functionality in their own code. Param is very similar in spirit to the Enthought Traits library, despite having been developed independently, and offers much of the same functionality. Param is particularly useful for people who find that Traits is difficult to integrate into their work flow, since it consists of only two pure Python files with no dependencies outside the standard library. Param is also useful for people building Tk applications, and provides an optional Tk property-sheet interface that can automatically generate a GUI window for viewing and editing an object's Parameters.
Param is freely available under a BSD license from: http://ioam.github.com/param/
Enaml is a new domain specific declarative language for specifying user interfaces in Python applications. Its syntax, a strict superset of the Python language, provides a clean and compact representation of UI layout and styling, and uses dynamic expressions to bind a view's logic with an application's underlying computational model.
A number of considerations were given during the design of Enaml with the ultimate goal being the creation of a dynamic UI framework that has a low barrier of entry and can scale in complexity and capability according to the needs of the developer.
Influence Enaml improves upon existing technologies and ideas for specifying user interfaces. Much of Enaml's inspiration comes from Qt's QML, a declarative UI language derived from ECMAScript and designed specifically for developing mobile applications with the Qt toolkit. In contrast, Enaml is designed for the development of scientific and enterprise level applications, and makes use of a Python derived syntax and standard desktop-style widget elements. For layout, Enaml raises the bar by providing a system based on symbolic constraints. The underyling technology is the same which powers the Cocoa Auto-Layout system in OSX 10.7, however in Enaml, the constraints are exposed in a friendly Pythonic fashion.
Toolkit Independence In large projects, the costs of changing infrastructure can be extremely high. Instead of forcing an application to be tied to a single underlying toolkit, Enaml is designed to be completely toolkit agnostic. This decoupling provides the benefit of being able to migrate an entire project from one gui library to another by changing only a single line of code or setting an environment variable. Enaml currently supports both Qt (via Pyside or PyQt4) and WxPython backends with plans for HTML 5 in the future. The authoring of new toolkit backends has been designed to be a simple affair. Adding new or custom widgets to an existing toolkit is trivial.
Extensibility A good framework should be useable by a wide variety of audiences and should be able to adapt to work with technologies not yet invented. Enaml can provide the UI layer for any Python application, with few limitations placed on the architecture of the underlying computational model. While Enaml understands Enthought's Traits based models by default, it provides simple hooks that the developer can use to extend its functionality to any model architecture that provides some form of notification mechanism. Possibilities include, but are not limited to, models built upon databases, sockets, and pub-sub mechanisms.
Continuity No matter how easy it is to get started with a new framework, it will not be adopted if the cost of switching is exceedingly high. Enaml is positioned to become the next generation of TraitsUI, the user interface layer of the Traits library. Enaml can both include existing TraitsUI views in an application as well as itself be embedded within a TraitsUI. Enaml also interacts seamlessly with the Chaco plotting library, allowing easy integration of interactive graphics. Enaml cleanly exposes the toolkit specific objects that it manages, allowing a user with a large amount of toolkit specific code to continue to use that code with little or no changes. This provides a path forward for both TraitsUI and non-TraitsUI applications.
Cellular populations in biology are often heterogeneous, and aggregate assays such as expression arrays can obscure the small differences between these populations. Examples where these differences can be highly significant include the identification of antigen-specific immune cells, stem cells and circulating cancer cells. As the frequency of such cells in the blood can be vanishingly small, assays to detect signals at the single cell level are essential. Flow cytometry is probably the best established single cell assay, and has been an integral tool in immunology and biology for decades, able to measure cellular marker levels for individual cells, as well as population statistics over millions of cells.
Recent technological innovations in flow cytometry have increased the number of cell markers capable of being resolved simultaneously, and visual analysis (gating) is difficult and error prone with increasing data dimensionality. Hence there is increasing demand for tools to automate the analysis and management of flow data, so as to increase accuracy and reproducibility. However, essentially all software used by flow cytometry laboratories is commercial and based on the visual analysis paradigm. With the exception of the R BioConductor project, we are not aware of any other full-featured open source tools for analyzing flow data. The few open source flow software modules that exist simply extracts data from FCS (flow cytometry standard) files into tabular/csv format, losing all metadata associated with the file, and provide no additional tools for analysis. We therefore decided to develop the fcm library in python that would provide a foundation for flow cytometry data management and analysis.
The fcm library provides functions to load fcs files, apply spectral compensation, and perform standard log and log-like transforms for visualization. The library also provides objects and methods for traditional gating-based analysis, including standard polygon, threshold, interval, and quadrant gates. Using fcm and other common python libraries, one can quickly write scripts for doing large scale batch analysis. In addition to gating- based analysis, fcm provides methods to do model-based analysis, utilizing GPU-optimized statistical models to identify cell subsets. These statistical models provide a data-driven way to construct generative probability models that scale well with the increasing dimensionality of flow data and do not require expert input to identify cell subsets. High performance computational routines to fit statistical models are optimized using cython and pycuda. More specialized tools for the analysis of flow data include the use of a novel information measure to optimize reagent panels and analysis strategies, and optimization methods for automatic determination of positivity thresholds.
We are currently using the fcm library for the analysis of tetramer assays for cancer immunotherapy, as well as intracellular expression of effector molecules in the NIAID-sponsored External Quality Assurance Policy Oversight Laboratory (EQAPOL) program to standardize flow cytometry assays in HIV studies. An illustrative example is the use of fcm in building a pipeline for the Cytostream application to automate the analysis of 459 FCS files from 12 laboratories, reducing the analysis time of one month to a single evening.
One of the principal goals of the Janelia Farm Research Campus is the reconstruction of complete neuronal circuits. This involves 3D electron- microscopy (EM) volumes many microns across with better than 10nm resolution, resulting in gigavoxel scale images. From these, individual neurons must be segmented out. Although image segmentation is a well-studied problem, these data present unique challenges in addition to scale: neurons have an elongated, irregular branching structure, with processes up to 50nm thin but hundreds of micrometers long); one neuron looks much like the next, with only a thin cellular boundary separating densely packed neurons; and internal neuronal structures can look similar to the cellular boundary. The first problem in particular means that small errors in segment boundary predictions can lead to large errors in neuron shape and neuronal network connectivity.
Our segmentation workflow has three main steps: a voxelwise edge classification, a fine-grained segmentation into supervoxels (which can reasonably be assumed to be atomic groups of voxels), and hierarchical region agglomeration.
For the first step, we use Ilastik, a pixel-level interactive learning program. Ilastik uses the output of various image filters as features to classify voxels as labeled by the user. We then use the watershed algorithm on the resulting edge probability map to obtain supervoxels. For the last step, we developed a new machine learning algorithm (Nunez-Iglesias et al, in preparation).
Prior work has used the mean voxel-level edge-probability along the boundaries between regions to agglomerate them. This strategy works extremely well because boundaries get longer as agglomeration proceeds, resulting in ever- improving estimates of the mean probability. We hypothesized that we could improve agglomeration accuracy by using a classifier (which can use many more features than the mean). However, a classifier can perform poorly because throughout agglomeration we may visit a part of the feature space that has not yet been sampled. In our approach, we use active learning to ensure that we have examples from all parts of the space we are likely to encounter.
We implemented our algorithm in arbitrary dimensions in an open-source, MIT- licensed Python library, Ray (https://github.com/jni/ray). Ray combines leading scientific computing Python libraries, including NumPy, SciPy, NetworkX, and scikits-learn to deliver state of the art segmentation accuracy in Python.
Nuclear magnetic resonance (NMR) spectroscopy is a key analytical technique in the biomedical field, finding uses in drug discovery, metabolomics, and imaging as well as being the primary method for the determination of the structures of biological macromolecules in solution. In the course of a modern NMR structural or dynamic study of proteins and other biomolecules, experiments typically generate multiple gigabytes of 2D, 3D and even 4D data sets which must be collected, processed, analyzed, and visualized to extract useful information. The field has developed a number of software products to perform these functions, but few software suites exist that can perform all of the tasks which a typical scientist requires. For example, it is not uncommon for NMR data to be collected using software provided by the spectrometer vendor, processed and visualized using software from the NIH, and analyzed using software from a University, collaborator or developed in house. Complicating this process is the lack of a standard format for storing NMR data; each software program typically uses its own format for data storage.
nmrglue is an open source Python module for working with NMR data which acts as the "glue" to tie together existing NMR programs, and can be used to rapidly develop new NMR processing, analysis or visualization methods. With nmrglue, spectral data from a number of common NMR file formats can be accessed as numpy arrays. This data can be sliced, rearranged or modified as needed and written out to any of the supported file formats for later use in existing NMR software programs. In this way, nmrglue can act as the "glue" to tie together NMR workflows which employ existing NMR software.
In addition, nmrglue can be used in conjunction with other scientific python libraries to rapidly test, prototype, and develop new methods for processing, analyzing, and visualizing NMR data. The nmrglue package provides a number of common NMR processing functions, as well as implementation of scientific routines which may be of interest to other Python projects including peak pickers, multidimensional lineshape fitting routines, linear prediction functions, and a bounded least squares optimization. These functions together, with the ability to read, write and convert between a number of common file formats, allow developers to harness nmrglue for established routines while focusing on the novel portion of the new method being created. In addition, the numerical routines in numpy and scipy can be used to further speed this process. If these packages are used with the Ipython shell and matplotlib, a robust, interpreted environment for exploring and visualizing NMR data can be created using only open source software.
nmrglue is distributed under the New BSD license. Documentation, tutorials, examples, and downloadable install files and source code are available at http://code.google.com/p/nmrglue/. Despite a limited exposure in the scientific field, nmrglue is already used in a number of university research labs and portions of the package have been adapted for use in VeSPA, a software suite for magnetic resonance spectroscopy.
Luban is different from any existing web frameworks in philosophy: it provides a generic specification "language" for describing user interface, and a luban specification of user interface can be automatically rendered into web or native user interfaces using media-specific languages.
Luban is focused on providing a simple, easy-to-understand syntax to describe user interfaces, and hence allows users to focus more on the business logic needed behind user interfaces.
In this talk I will discuss recent developments of luban and some of its applications.
Life Technologies relies heavily on Python for product development. Here we present examples of using Python with the Numpy/SciPy/Matplotlib stack at Life Technologies for sequencing analysis, Bayesian estimation, mRNA complexity study, and customer survey analysis. We also display our use of Django for developing scientific web tools in Python. These applications, taken together, demonstrate scientific Python’s vital position in Life Technologies’ tool chain.
Bokeh is a new plotting framework for Python that natively understands the relationships in multidimensional datasets, uses a Protovis-like expression syntax scheme for creating novel visualizations, and is designed from the ground up to be used on the web.
Although it can be thought of as "ggplot for Python", the goals of Bokeh are much more ambitious. The Grammar of Graphics primarily addresses the mapping of pre-built aeshetics and layouts to a particular data schema and tuples of measure variables. It has limited facility for expressing data interactivity, and its small set of graph types (aka "geoms" or glyphs) are somewhat limited in both their number and in the number of ways they can be combined with one another.
On the flip side, most existing Python plotting frameworks adopt a "tell me how" instead of a "tell me what" approach. Thus, user plotting code canfrequently become mired down in what amounts to details of the rendering system.
In our talk, we will show various features of Bokeh, and talk about future development. We will also go into some detail about how Bokeh unifies the tasks of describing data mapping, building data-driven layout, and composing novel visualizations using a single, multi-purpose scene and data graph.
IPython started as a better interactive Python interpreter in 2001, but over the last decade it has grown into a rich and powerful set of interlocking tools aimed at enabling an efficient, fluid and productive workflow in the typical use cases encountered by scientists in everyday research.
Today, IPython consists of a kernel executing user code and capable of communicating with a variety of clients, using ZeroMQ for networking via a well-documented protocol. This enables IPython to support, from a single codebase, a rich variety of usage scenarios through user-facing applications and an API for embedding:
An interactive, terminal-based shell with many capabilities far beyond the default Python interactive interpreter (this is the default application opened by the ipython command that most users are familiar with).
A Qt console that provides the look and feel of a terminal, but adds support for inline figures, graphical calltips, a persistent session that can survive crashes of the kernel process, and more.
A web-based notebook that can execute code and also contain rich text and figures, mathematical equations and arbitrary HTML. This notebook presents a document-like view with cells where code is executed but that can be edited in-place, reordered, mixed with explanatory text and figures, etc.
A high-performance, low-latency system for parallel computing that supports the control of a cluster of IPython engines communicating over ZeroMQ, with optimizations that minimize unnecessary copying of large objects (especially numpy arrays).
In this talk we will show how IPython supports all stages in the lifecycle of a scientific idea: individual exploration, collaborative development, large- scale production using parallel resources, publication and education. In particular, the IPython Notebook supports multiuser collaboration and allows scientists to share their work in an open document format that is a true "executable paper": notebooks can be version controlled, exported to HTML or PDF for publication, and used for teaching. We will demonstrate the key features of the system,
Purple sea urchins (Strongylocentrotus purpuratus or Sp) are invertebrates that share more than 7,000 genes with humans, more than other common model invertebrate organisms like fruit flies and worms. In addition, the innate immune system of sea urchins demonstrates unprecedented complexity. These factors make the sea urchin a very interesting organism for investigations of immunology. Of particular interest are the set of proteins in SP that contain C-type lectin (CLECT) domains, a functional region in the protein which recognizes sugars. Proteins containing CLECTs may be particularly important to immune system robustness because of sugars that are present on pathogens.
The primary goals of this research project are first to identify all the CLECT-containing proteins in the Sp genome, and then to predict their function based on similarity to characterized proteins in other species (protein homology or similarity). The latter goal is particularly challenging and requires new and creative analysis methods.
From an informational viewpoint, proteins are represented by a unique sequence of letters, each letter corresponding to an amino acid. For example G-A-V indicates the sequence glycine, alanine and valine. Commonality between proteins is usually measured by sequence alignments; that is, by directly comparing the sequence of letters between two proteins. Algorithms and tools for these alignments are among the most standardized and available tools in bioinformatics.
Sequence similarity between homologous proteins can degrade over long evolutionary timescales. This is in part because some mutations at the sequence level can occur without compromising a protein's overall function. This is akin to the evolution of a language, e.g modern English and middle English, which initially appear to be separate languages due to spelling differences. Because domains are regions of a protein which can function semi- independently, they are less prone to accommodate mutations. By comparing proteins based on the ordering of their domains, or their "domain architecture", it becomes possible to identify homology, or similarities in domain order, separated by extensive evolution.
Alignment tools based on domain architecture are promising, but are still in their infancy. Consequently, very few researchers utilize both sequence and domain alignment methodologies corroboratively. Using Python scripts in tandem with various web tools and databases, we have identified the top alignment candidates for the CLECT-containing Sp proteins using both methods. With the help of the Enthought Tool Suite, we have created a simple visualization tool that allows users to examine the sequence alignments side-by-side with two types of domain alignments. The information provided by these three results together is much more informative with respect to predicting protein function than any single method alone. Finally, we have developed a systematic set of heuristic rules to allow users to make objective comparisons among the three sets of results. The results can later be parsed using Python scripts to make quantitative and qualitative assessments of the dataset. We believe that these new comparison and visualization techniques will apply in general to computational proteomics.
The LaTeX document preparation system is frequently used to create scientific documents and presentations. This process is often inefficient. The user must switch back and forth between the document and external scientific software that is used for performing calculations and creating figures. PythonTeX is a LaTeX package that allows Python code to be entered directly within a LaTeX document. The code is automatically executed and its output is included within the original document. The code may also be typeset within the document with syntax highlighting provided by Pygments.
PythonTeX is fast and user-friendly. Python code is separated into user- defined sessions, and each session is only executed when its code is modified. When code is executed, sessions run in parallel. The contents of stdout and stderr are synchronized with the LaTeX document, so that printed content is easily accessible and error messages have meaningful line numbering.
PythonTeX greatly simplifies scientific document creation with LaTeX. For example, SymPy can be used to automatically solve and typeset step-by-step mathematical derivations. It can also be used to automate the creation of mathematical tables. Plots can be created with matplotlib and then easily customized in place. Python code and its output can be typeset side by side. The full power of Python is conveniently available for programming LaTeX macros and customizing and automating LaTeX documents.
The Object Oriented Finite-Element project at NIST is a Python and C++ tool designed to bring sophisticated numerical modeling capabilities to users in the field of Materials Science. The software provides numerous tools for constructing finite-element meshes from microstructural images, and for implementing material properties from a very broad class which includes elasticity, chemical and thermal diffusion, and electrostatics. The current series of releases has a robust interface for defining new nonlinear properties, and provides both first and second order time-dependence in the equations of motion. The development team is currently working on a fully-3D version of the code, as well as expanding the scope of available properties to include surface interactions, such as surface tension and chemical reactions, and inequality constraints, such as arise in mechanical surface contact and plasticity. The software is a hybrid of Python and C++ code, with the high level user interface and control code in Python, and the heavy numeric work being done in C++. The software can be operated either as an interactive, GUI- driven application, as a scripted command-line tool, or as a supporting library, providing useful access to users of varying levels of expertise. At every level, the user-interface objects are intended to be familiar to the materials-science user. This presentation will focus on an interesting example of a nonlinear property, called Ramberg-Osgood elasticity, and the process for incorporating this feature into the OOF architecture.
Interactivity is an important part of computer visualization of data, but all too often the user interfaces to control the visualization are far from optimal. This talk will show how you can use the Enable and the Chaco to build interactive visualization widgets which give much better user feedback than sliders or text fields.
Chaco is an open-source interactive 2D plotting library that is part of the Enthought tool-suite, which is in turn built upon the Enable interactive 2D drawing library that are compatible with PyQt, WxPython, Pyglet and VTK. These libraries are written in Python and are key tools that Enthought uses to deliver scientific applications to our clients.
This talk will show how to use these tools to build UI widgets that can be used to control visualizations interactively. Rather than building a complex, monolithic control, the approach that we will demonstrate builds the control our of many smaller interactions, each controlling a small piece of the overall state of a visualization, with a high level of reusability.
As a simple but useful case-study, we'll show how we built an interactive histogram widget that can be use to adjust the brightness, contrast, gamma and other attributes of an image in real-time. We'll also discuss some of the tricks we used to keep the user interactions responsive in the face of having to visualize larger images.