Video recording and production done by Scipy
Authors: Kastner, Kyle, Southwest Research Institute
Track: Machine Learning
This talk will be an introduction to the root concepts of machine learning, starting with simple statistics, then working into parameter estimation, regression, model estimation, and basic classification. These are the underpinnings of many techniques in machine learning, though it is often difficult to find a clear and concise explanation of these basic methods.
Parameter estimation will cover Gaussian parameter estimation of the following types: known variance, unknown mean; known mean, unknown variance; and unknown mean, unknown variance.
Regression will cover linear regression, linear regression using alternate basis functions, bayesian linear regression, and bayesian linear regression with model selection.
Classification will extend the topic of regression, exploring k-means clustering, linear discriminants, logistic regression, and support vector machines, with some discussion of relevance vector machines for "soft" decision making.
Starting from simple statistics and working upward, I hope to provide a clear grounding of how basic machine learning works mathematically. Understanding the math behind parameter estimation, regression, and classification will help individuals gain an understanding of the more complicated methods in machine learning. This should help demystify some of the modern approaches to machine learning, leading to better technique selection in real-world applications.
Presenters: Gaël Varoquaux, Jake Vanderplas, Olivier Grisel
Machine Learning is the branch of computer science concerned with the development of algorithms which can learn from previously-seen data in order to make predictions about future data, and has become an important part of research in many scientific fields. This set of tutorials will introduce the basics of machine learning, and how these learning tasks can be accomplished using Scikit-Learn, a machine learning library written in Python and built on NumPy, SciPy, and Matplotlib. By the end of the tutorials, participants will be poised to take advantage of Scikit-learn's wide variety of machine learning algorithms to explore their own data sets. The tutorial will comprise two sessions, Session I in the morning (intermediate track), and Session II in the afternoon (advanced track). Participants are free to attend either one or both, but to get the most out of the material, we encourage those attending in the afternoon to attend in the morning as well.
Session I will assume participants already have a basic knowledge of using numpy and matplotlib for manipulating and visualizing data. It will require no prior knowledge of machine learning or scikit-learn. The goals of Session I are to introduce participants to the basic concepts of machine learning, to give a hands-on introduction to using Scikit-learn for machine learning in Python, and give participants experience with several practical examples and applications of applying supervised learning to a variety of data. It will cover basic classification and regression problems, regularization of learning models, basic cross-validation, and some examples from text mining and image processing, all using the tools available in scikit-learn.
Tutorial 1 (intermediate track)
0:00 - 0:15 -- Setup and Introduction
0:15 - 0:30 -- Quick review of data visualization with matplotlib and numpy
0:30 - 1:00 -- Representation of data in machine learning
Downloading data within scikit-learn
Categorical & Image data
Exercise: vectorization of text documents
1:00 - 2:00 -- Basic principles of Machine Learning & the scikit-learn interface
Supervised Learning: Classification & Regression
Unsupervised Learning: Clustering & Dimensionality Reduction
Example of PCA for data visualization
Flow chart: how do I choose what to do with my data set?
Exercise: Interactive Demo on linearly separable data
Regularization: what it is and why it is necessary
2:00 - 2:15 -- Break (possibly in the middle of the previous section)
2:15 - 3:00 -- Supervised Learning
Example of Classification: hand-written digits
Cross-validation: measuring prediction accuracy
Example of Regression: boston house prices
3:00 - 4:15 -- Applications
Examples from text mining
Examples from image processing
This tutorial will use Python 2.6 / 2.7, and require recent versions of numpy (version 1.5+), scipy (version 0.10+), matplotlib (version 1.1+), scikit-learn (version 0.13.1+), and IPython (version 0.13.1+) with notebook support. The final requirement is particularly important: participants should be able to run IPython notebook and create & manipulate notebooks in their web browser. The easiest way to install these requirements is to use a packaged distribution: we recommend Anaconda CE, a free package provided by Continuum Analytics: http://continuum.io/downloads.html or the Enthought Python Distribution: http://www.enthought.com/products/epd...
Presenter: Fernando Perez
IPython began its life as a personal "afternoon hack", but almost 12 years later it has become a large and complex project, where we try to think in a comprehensive and coherent way about many related problems in scientific computing. Despite all the moving parts in IPython, there are actually very few key ideas that drive our vision, and I will discuss how we seek to turn this vision into concrete technical constructs. We focus on making the computer a tool for insight and communication, and we will see how every piece of the IPython architecture is driven by these ideas.
I will also look at IPython in the context of the broader SciPy ecosystem: both how the project's user and developer community has evolved over time, and how it maintains an ongoing dialogue with the rest of this ecosystem. We have learned some important lessons along the way that I hope to share, as well as considering the challenges that lie ahead.
Presenter: Christopher Fonnesbeck
This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to Bayesian methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.
The target audience for the tutorial includes all new Python users, though we recommend that users also attend the NumPy and IPython session in the introductory track.
Tutorial GitHub repo: https://github.com/fonnesbeck/statist...
Introduction to Pandas (45 min)
Series and DataFrame objects
Indexing, data selection and subsetting
Reading and writing files
Data Wrangling with Pandas (45 min)
Indexing, selection and subsetting
Reshaping DataFrame objects
Data aggregation and GroupBy operations
Merging and joining DataFrame objects
Plotting and Visualization (45 min)
Time series plots
Visualization pro tips
Statistical Data Modeling (45 min)
Fitting data to probability distributions
Time series analysis
Python 2.7 or higher (including Python 3)
pandas 0.11.1 or higher, and its dependencies
NumPy 1.6.1 or higher
matplotlib 1.0.0 or higher
IPython 0.12 or higher
Authors: Kitchin, John Carnegie Mellon University
Track: Reproducible Science
We will discuss the use of emacs + org-mode + python in enabling reproducible research. This combination of software enables researchers to intertwine narrative and mathematical text with figures and code that is executable within a document, with capture of the output. Portions of the document can be selectively exported to LaTeX, HTML, pdf and other other formats. We have used this method to produce technical manuscripts submitted for peer review in scientific journals, in the preparation of two e-books (about 300 pages each) on using python in scientific and engineering applications (http://jkitchin.github.com/pycse), and in using python in the modeling of the properties of materials with density functional theory (http://jkitchin.github.com/dft-book), as well as a python-powered blog at http://jkitchin.github.com. Our experience suggests all three components are critical for enabling reproducible research in practice: an extensible editor, a markup language that separates text, math, data and code, and an effective language such as python. We will show examples of the pros and cons of this particular implementation of editor/markup/code combination.
Authors: Kridler, Nicholas, Accretive Health
Bits are bits. Whether you are searching for whales in audio clips or trying to predicit hospitalization rates based on insurance claims, the pro- cess is the same: clean the data, generate features, build a model, and iter- ate. Better features lead to a better model, but without domain expertise it is often difficult to extract those features. Numpy/Scipy, Matplotlib, Pandas, and Sci-kit Learn provide an excellent framework for data anal- ysis and feature discovery. This is evidenced by high performing models in the Heritage Health Prize and the Marinexplore Right Whale Detec- tion challenge. In both competitions, the largest performance gains came from identifying better features. This required being able to repeatedly visualize and characterize model successes and failures. Python provides this capability as well as the ability to rapidly implement and test new features. This talk will discuss how Python was used to develop competi- tive predictive models based on derived features discovered through data analysis.
Authors: Michael Droettboom
Track: Reproducible Science
This talk will be a general "state of the project address" for matplotlib, the popular plotting library in the scientific Python stack. It will provide an update about new features added to matplotlib over the course of the last year, outline some ongoing planned work, and describe some challenges to move into the future. The new features include a web browser backend, "sketch" style, and numerous other bugfixes and improvements. Also discussed will be the challenges and lessons learned moving to Python 3. Our new "MEP" (matplotlib enhancement proposal) method will be introduced, and the ongoing MEPs will be discussed, such as moving to properties, updating the docstrings, etc. Some of the more pie-in-the-sky plans (such as styling and serializing) will be discussed. It is hoped that this overview will be useful for those who use matplotlib, but don't necessarily follow its mailing list in detail, and also serve as a call to arms for assistance for the project.
Authors: Beaumont, Christopher, U. Hawaii; Robitaille, Thomas, MPIA; Borkin, Michelle, Harvard; Goodman, Alys
Modern research projects incorporate data from several sources, and new insights are increasingly driven by the ability to interpret data in the context of other data. Glue (http://glueviz.org) is a graphical environment built on top of the standard Python science stack to visualize relationships within and between data sets. With Glue, users can load and visualize multiple related data sets simultaneously. Users specify the logical connections that exist between data, and Glue transparently uses this information as needed to enable visualization across files. This functionality makes it trivial, for example, to interactively overplot catalogs on top of images.
The central philosophy behind Glue is that the structure of research data is highly customized and problem-specific. Glue aims to accomodate and to simplify the "data munging" process, so that researchers can more naturally explore what their data has to say. The result is a cleaner scientific workflow, and more rapid interaction with data.
Presenter: Olivier Grisel
This will give an overview of recent trends in Machine Learning namely: Deep Learning, Probabilistic Programming and Distributed Computing for Machine Learning and will demonstrate how the SciPy community at large is building innovative tools to follow those trends and sometimes even lead them.
Authors: Johnson, Leif, University of Texas at Austin
Track: Machine Learning
Sparse coding and feature learning have become popular areas of research in machine learning and neuroscience in the past few years, and for good reason: sparse codes can be applied to real-world data to obtain "explanations" that make sense to people, and the features used in these codes can be learned automatically from unsupervised datasets. In addition, sparse coding is a good model for the sorts of data processing that happens in some areas of the brain that process sensory data (Olshausen & Field 1996, Smith & Lewicki 2006), hinting that sparsity or redundancy reduction (Barlow 1961) is a good way of representing raw, real-world signals.
In this talk I will summarize several algorithms for sparse coding (k-means [MacQueen 1967], matching pursuit [Mallat & Zhang 1994], lasso regression [Tibshirani 1996], sparse neural networks [Lee Ekanadham & Ng 2008, Vincent & Bengio 2010]) and describe associated algorithms for learning dictionaries of features to use in the encoding process. The talk will include pointers to several nice Python tools for performing these tasks, including standard scipy function minimization, scikit-learn, SPAMS, MORB, and my own packages for building neural networks. Many of these techniques converge to the same or qualitatively similar solutions, so I will briefly mention some recent results that indicate the encoding can be more important than the specific features that are used (Coates & Ng, 2011).
Authors: Bekolay, Trevor, University of Waterloo
Code that properly tracks the units associated with physical quantities is self-documenting and far more robust to unit conversion errors. Unit conversion errors are common in any program that deal with physical quantities, and have been responsible for several expensive and dangerous software errors, like the Mars Climate Orbiter crash. Support for tracking units is lacking in commonly used packages like NumPy and SciPy. As a result, a whole host of packages have been created to fill this gap, with varying implementations. Some build on top of the commonly used scientific packages, adding to their data structures the ability to track units. Others packages track units separately, and store a mapping between units and the data structures containing magnitudes.
I will discuss why tracking physical quantities is an essential function for any programming language heavily used in science. I will then compare and contrast all of the packages that currently exist for tracking quantities in terms of their functionality, syntax, underlying implementation, and performance. Finally, I will present a possible unification of the existing packages that enables the majority of use cases, and I will discuss where that unified implementation fits into the current scientific Python environment.
Presentation of finalists for excellence in plotting using Matplotlib.
Authors: Cordoba, Carlos, The Spyder Project
Track: Reproducible Science
The enormous progress made by the IPython project during the last two years, has made many of us --the Python scientific community-- think that we are quite close to provide an application that can rival the big two M's of the computing scientific world: Matlab and Mathematica.
However, after following the project on GitHub and its mailing list for almost the same time and specially after reading its roadmap for the next two years, we at Spyder believe that its real focus is different from that aim. IPython developers are working hard to build several powerful and flexible interfaces to evaluate and document code, but they seem to have some troubles on going from a console application to a GUI one (e.g see GitHub Issues 1747, 2203, 2522, 2974 and 2985).
We consider Spyder can really help to solve these issues, by integrating IPython in a richer and more intuitive, yet powerful, environment. After working with the mentioned M's, most people expects not only a good evaluation interface but also easy access to rich text documentation, an specialized editor and a namespace browser, tied with good debugging facilities. Spyder already has all these features and, right now, also the best integration with the IPython Qt frontend.
This shows that Spyder can be the perfect complement to IPython, providing what it's missing and aiming to reach a wider audience (not just researchers and graduate students). As the current Spyder maintainer, I would like to assist to SciPy to show more concretely to the community what our added value to the scientific Python ecosystem is. We would also like to get in closer contact with her and have a direct feedback to define what should be the features we need to work or improve on the next releases.
Series of lightening talks
Authors: Poore, Geoffrey, Union University
Track: Reproducible Science
Writing a scientific document can be slow and error-prone. When a figure or calculation needs to be modified, the code that created it must be located, edited, and re-executed. When data changes or analysis is tweaked, everything that depends on it must be updated. PythonTeX is a LaTeX package that addresses these issues by allowing Python code to be included within LaTeX documents. Python code may be entered adjacent to the figure or calculation it produces. Built-in utilities may be used to track dependencies.
PythonTeX maximizes performance and efficiency. All code output is cached, so that documents can be compiled without executing code. Code is only re-executed when user-specified criteria are met, such as exit status or modified dependencies. In many cases, dependencies can be detected and tracked automatically. Slow code may be isolated in user-defined sessions, which automatically run in parallel. Errors and warnings are synchronized with the document so that they have meaningful line numbers.
Since PythonTeX documents mix LaTeX and Python code, they are less portable than plain LaTeX documents. PythonTeX includes a conversion utility that creates a new copy of a document in which all Python code is replaced by its output. The result is suitable for journal submission or conversion to other formats such as HTML.
While PythonTeX is primarily intended for Python, its design is largely language-independent. Users may easily add support for additional languages.
Authors: Moore, Jason, University of California at Davis; Dembia, Christopher, Stanford
Track: Medical Imaging
The Yeadon human body segment inertia model is a widely used method in the biomechanical field that allows scientists to get quick and reliable estimations of mass, mass location, and inertia estimates of any human body. The model is formulated around a collection of stadia solids that are defined by a series of width, perimeter, and circumference measurements. This talk will detail a Python software package that implements the method and exposes a basic API for its use within other code bases. The package also includes a text based user interface and a graphical based user interface, both of which will be demonstrated. The GUI is implemented with MayaVi and allows the user to manipulate the joint angles of the human and instantaneously get inertia estimates for various poses. Researchers that readily need body segment and human inertial parameters for dynamical model development or other uses, should find this package useful for quick interactive results. We will demonstrate the three methods of using the package, cover the software design, show how the software can be integrated into other packages, and demonstrate a non-trivial example of computing the inertial properties of a human seated on a bicycle.
Authors: Zinkov, Rob
Track: Machine Learning
Many machine learning problems involved datasets with complex dependencies between variables we are trying to predict and even the data points themselves. Unfortunately most machine learning libraries are unable to model these dependencies and make use of them. In this talk, I will introduce two libraries pyCRFsuite and pyStruct and show how they can be used to solve machine learning problems where modeling the relations between data points is crucial for getting reasonable accuracy. I will cover how these libraries can be used for classifying webpages as spam, named entity extraction, and sentiment analysis.
Authors: Ivanov, Paul, UC Berkeley
In this talk, I will focus on the how of reproducible research. I will focus on specific tools and techniques I have found invaluable in doing research in a reproducible manner. In particular, I will cover the following general topics (with specific examples in parentheses): version control and code provenance (git), code verification (test driven development, nosetests), data integrity (sha1, md5, git-annex), seed saving ( random seed retention ) distribution of datasets (mirroring, git-annex, metalinks), light-weight analysis capture ( ttyrec, ipython notebook)
Authors: Skillman, Samuel, University of Colorado at Boulder; Turk, Matthew, Columbia University
We will describe the development, design, and deployment of the volume rendering framework within yt, an open-source python library for computational astrophysics. In order to accommodate increasingly large datasets, we have developed a parallel kd-tree construction written using Python, Numpy, and Cython. We couple this parallel kd-tree with two additional levels of parallelism exposed through image plane decomposition with mpi4py and individual brick traversal with OpenMP threads for a total of 3 levels of parallelism. This framework is capable of handling some of the world's largest adaptive mesh refinement simulations as well some of the largest uniform grid data (up to 40963 at the time of this submission). This development has been driven by the need for both inspecting and presenting our own scientific work, with designs constructed by our community of users. Finally, we will close by examining case studies which have benefited from the user-developed nature of our volume renderer, as well as discuss future improvements to both user interface and parallel capability.
Authors: Zhang, Zhang, Intel Corporation; Rosenquist, Todd, Intel Corporation; Moffat, Kent, Intel Corporation
The call for reproducible computational results in scientific research areas has increasingly resonated in recent years. Given that a lot of research work uses mathematical tools and relies on modern high performance computers for numerical computation, obtaining reproducible floating-point computation results becomes fundamentally important in ensuring that research work is reproducible.
It is well understood that, generally, operations involving IEEE floating-point numbers are not associative. For example, (a+b)+c may not equal a+(b+c). Different orders of operations may lead to different results. But exploiting parallelism in modern performance-oriented computer systems has typically implied out-of-order execution. This poses a great challenge to researchers who need exactly the same numerical results from run to run, and across different systems.
This talk describes how to use tools such as Intel® Math Kernel Library (Intel® MKL) and Intel® compilers to build numerical reproducibility into Python based tools. Intel® MKL includes a feature called Conditional Numerical Reproducibility that allows users to get reproducible floating-point results when calling functions from the library. Intel® compilers provide broader solutions to ensure the compiler-generated code produces reproducible results. We demonstrate that scientific computing with Python can be numerically reproducible without losing much of the performance offered by modern computers. Our discussion focuses on providing different levels of controls to obtain reproducibility on the same system, across multiple generations of Intel architectures, and across Intel architectures and Intel-compatible architectures. Performance impact of each level of controls is discussed in detail. Our conclusion is that, there is usually a certain degree of trade-off between reproducibility and performance. The approach we take gives the end users many choices of balancing the requirement of reproducible results with the speed of computing.
This talk uses NumPy/SciPy as an example, but the principles and the methodologies presented apply to any Python tools for scientific computing.
Authors: Davison, Andrew, CNRS (principal developer); Wheeler, Daniel, NIST (speaker)
Track: Reproducible Science
Sumatra is a lightweight system for recording the history and provenance data for numerical simulations. It works particularly well for scientists that are in the intermediate stage between developing a code base and using that code base for active research. This is a common scenario and often results in a mode of development that mixes branching for both code development and production simulations. Using Sumatra avoids this unintended use of the versioning system by providing a lightweight design for recording the provenance data independently from the versioning system used for the code development. The lightweight design of Sumatra fits well with existing ad-hoc patterns of simulation management contrasting with more pervasive workflow tools, which can require a wholesale alteration of work patterns. Sumatra uses a straightforward Django-based data model enabling persistent data storage independently from the Sumatra installation. Sumatra provides a command line utility with a rudimentary web interface, but has the potential to become a full web-based simulation management solution. During the talk, the speaker will provide an introduction to Sumatra as well as demonstrate some typical usage patterns and discuss achievable future goals.
Authors: Signell, Richard, US Geological Survey
Track: Meteorology, Climatology, Atmospheric and Oceanic Science
Coastal ocean modelers are producers and consumers of vast and varied data, and spend significant effort on tasks that could be eliminated by better tools. In the last several years, standardization led by the US Integrated Ocean Observing System Program to use OPeNDAP for delivery of gridded data (e.g. model fields, remote sensing) and OGC Sensor Observation Services (SOS) for delivery of in situ data (e.g. time series sensors, profilers, ADCPs, drifters, gliders) has resulted in significant advancements, making it easier to deliver, find, access and analyze data. For distributing model results, the Unidata THREDDS Data Server and PyDAP deliver aggregated data via OPeNDAP and other web services with low impact on providers. For accessing data, NetCDF4-Python and PyDAP both allow efficient access to OPeNDAP data sources, but do not take advantage of common data models for structured and unstructured grids enabled by community-developed CF and UGRID conventions. This is starting to change with CF-data model based projects like the UK Met Office Iris project. Examples of accessing and visualizing both curvilinear and unstructured grid model output in Python will be presented, including both the IPython Notebook and ArcGIS 10.1.
Authors: Jacob Barhak
MIST stands for Misco-Simulation Tool. It is a modeling and simulation framework that supports computational Chronic Disease Modeling activities. It is a fork from the IEST = Indirect Estimation and Simulation Tool GPL modeling framework.
MIST removes complexity associated with the estimation engine, with parameter definitions, and with rule restrictions. This significantly simplifies the system and allows its development in the Micro-simulation path less encumbered.
The incentive to split MIST was to adapt the code to use newer compiler technology to speed up simulations. There is wrong skepticism in the medical disease modeling community towards using Interpreters for simulations due to performance issues. The use of advanced compiler technology with Python may remedy this misconception and provide optimized python based simulations. MIST is a first step in this direction.
MIST takes care of a few documented and known issues. It also moves to use new scientific Python stacks such as Anaconda and PythonXY as its platform. This improves its accessibility to less sophisticated users that can now benefit from easier installation.
The Reference Model for disease progression intends to use MIST as its main platform. Yet MIST is equipped with a Micro-simulation compiler designed to accommodate Monte Carlo simulations for other purposes.
Best-practice variant calling pipeline for fully automated high throughput sequencing analysis
Authors: Chapman, Brad; Kirchner, Rory; Hofmann, Oliver; Hide, Winston
bcbio-nextgen is an automated, scalable pipeline for detecting genomic variants from large-scale next-generation sequencing data. It organizes multiple best-practice tools for alignment, post-processing and variant calling into a single, easily configurable pipeline. Users specify inputs and parameters in a configuration file and the pipeline handles all aspects of software and data management. Large-scale analysis run in parallel on compute clusters using IPython and on cloud systems using StarCluster. The goal is to create a validated and community maintained pipeline for automated variant calling, allowing researchers to focus on answering biological questions.
Our talk will describe the practical challenges we face in scaling the system to handle large whole genome data for thousands of samples. We will also discuss current work to develop a variant reference panel and associated grading scheme that ensures reproducibility in a research world with rapidly changing algorithms and tools. Finally we details plans for integration with STORMseq, a user-friendly Amazon front end, designed to make the pipeline available to non-technical users.
The presentation will show how bringing together multiple open-source communities provides infrastructure that bridges technical gaps and moves analysis work to higher-level challenges.
Authors: Panel participants: Sergio Ray (Arizona State U), Shaun Walbridge (ESRI), Andrew Wilson (TWDB)
Track: GIS - Geospatial Data Analysis
Authors: Wilcox, Kyle, Applied Science Associates (ASA); Crosby, Alex, Applied Science Associates (ASA)
Track: GIS - Geospatial Data Analysis
LarvaMap is an open-access larval transport modeling tool. The idea behind LarvaMap is to make it easy for researchers everywhere to use sophisticated larval transport models to explore and test hypotheses about the early life of marine organisms.
LarvaMap integrates four components: an ocean circulation model, a larval behavior library, a python Lagrangian particle model, and a web-system for running the transport models.
An open-source particle transport model was written in python to support LarvaMap. The model utilizes a parallel multi-process architecture. Remote data are cached to a local file in small chunks when a process requires data, and the local data are shared between all of the active processes as the model runs. The caching approach improves performance and reduces the load on data servers by limiting the frequency and total number of web requests as well as the size of the data being moved over the internet.
Model outputs include particle trajectories in common formats (i.e. netCDF-CF and ESRI Shapefile), a web accessible geojson representation of the particle centroid trajectory, and a stochastic GeoTIFF representation of the probabilities associated with a collection of modeling runs. The common interoperable data formats allow a variety of tools to be used for in-depth analysis of the model results.
lpEdit: An editor to facilitate reproducible analysis via literate programming
Authors: Richards, Adam, Duke University, CNRS France; Kosinski Andrzej, Duke University; Bonneaud, Camille,
Track: Reproducible Science
There is evidence to suggest that a surprising proportion of published experiments in science are difficult if not impossible to reproduce. The concepts of data sharing, leaving an audit trail and extensive documentation are essential to reproducible research, whether it is in the laboratory or as part of an analysis. In this work, we introduce a tool for documentation that aims to make analyses more reproducible in the general scientific community.
The application, lpEdit, is a cross-platform editor, written with PyQt4, that enables a broad range of scientists to carry out the analytic component of their work in a reproducible manner---through the use of literate programming. Literate programming mixes code and prose to produce a final report that reads like an article or book. A major target audience of lpEdit are the researchers getting started with statistics or programming, so the hurdles associated with setting up a proper pipeline are kept to a minimum and the learning burden is reduced through the use of templates and documentation. The documentation for lpEdit is centered around learning by example, and accordingly we use several increasingly involved examples to demonstrate the software's capabilities.
Because it is commonly used, we begin with an example of Sweave in lpEdit and then in the same way R may be embedded into LaTeX we go on to show how Python can also be used. Next, we demonstrate how both R and Python code may be embedded into reStructuredText (reST). Finally, we walk through a more complete example, where we perform a functional analysis of high-throughput sequencing data, using the transcriptome of the butterfly species Pieris brassicae. There is substantial flexibility that is made available through the use of LaTeX and reST, which facilitates reproducibility through the creation of reports, presentations and web pages.
Authors: Pedersen, Brent; University of Colorado
After traditional bioinformatic analyses, we are often left with a set of genomic regions; for example: ChIP-Seq peaks, transcription-factor binding sites, differentially methylated regions, or sites of loss-of-heterozygosity. This talk will go over the difficulties commonly encountered at this stage of an investigation and cover some additional analyses, using python libraries, that can help to provide insight into the function of a set of intervals. Some of the libraries covered will be pybedtools, cruzdb, pandas, and shuffler. The focus will be on annotation, exploratory data analysis and calculation of simple enrichment metrics with those tools. The format will be a walk-through (in the IPython notebook) of a set of these analyses that utilizes ENCODE and other publicly available data to annotate an example dataset.
The Production of a Multi-Resolution Global Index for Geographic Information Systems
Authors: MacManus, Kytt, Columbia University CIESIN
Track: GIS - Geospatial Data Analysis
In order to efficiently access geographic information at the pixel level, at a global scale, it is useful to develop an indexing system with nested location information. Considering a 1 sq. km image resolution, the number of global pixels covering land exceeds 200 million. This talk will summarize the steps taken to produce a global multi-resolution raster indexing system using the Geospatial Data Abstraction Library (GDAL) 1.9, and NumPy. The implications of presenting this data to a user community reliant on Microsoft Office technologies will also be discussed.
The annual SciPy Conference allows participants from academic, commercial, and governmental organizations to showcase their latest Scientific Python projects, learn from skilled users and developers, and collaborate on code development.
The conference consists of two days of tutorials followed by two days of presentations, and concludes with two days of developer sprints on projects of interest to the attendees.The annual SciPy Conference allows participants from academic, commercial, and governmental organizations to showcase their latest Scientific Python projects, learn from skilled users and developers, and collaborate on code development.
The conference consists of two days of tutorials followed by two days of presentations, and concludes with two days of developer sprints on projects of interest to the attendees.
Authors: Simon Ratcliffe SKA South Africa, Ludwig Schwardt SKA South Africa
Track: Astronomy and Astrophysics
The Square Kilometer Array will be one of the prime scientific data generates in the next few decades.
Construction is scheduled to commence in late 2016 and last for the best part of a decade. Current estimates put data volume generation near 1 Exabyte per day with 2-3 ExaFLOPs of processing required to handle this data.
As a host country, South Africa is constructing a large precursor telescope known as MeerKAT. Once complete this will be the most sensitive telescope of it's kind in the world - until dwarfed by the SKA.
We make extensive use of Python from the entire Monitor and Control system through to data handling and processing.
This talk looks at our current usage of Python, and our desire to see the entire high performance processing chain being able to call itself Pythonic.
We will discuss some of the challenges specific to the radio astronomy environment and how we believe Python can contribute, particularly when it comes to the trade off between development time and performance.