Ideal Future Data Processing System

Summary

This a document to assemble ideas for elements of an ideal, next-generation seismic processing system for seismology.  The intent is to start this document at a SIG during the 2016 IRIS Workshop and expand this document in other forums until some convergence is reached.  Most notably students in the 2016 Data Processing Short Course will have an opportunity to revise this document.

Roots of this Package

Brief Development History

Any contributors to this page should document what and when they added it here.

First section was begun by Gary Pavlis and Frank Vernon where it was presented at the IRIS Workshop in June 2016.   They will present an initial draft of this document at that forum that will only be preserved in the master git repository.   During the discussion the document will be revised to reflect the broader community input.   The differences will be visible on github.

Community

Our dream is that from this document will emerge the seeds of new ideas that will flower into the next generation of seismic processing software.

Data Objects Supported

An ideal system should be as generic as possible, but not so general as to become ponderous.  In modern computing generic always means abstraction to some degree or the other and that usually means some version of object-oriented programming.  Whether an implementation chooses to use an OOP language is a different issue.  The idea here is that we should define what key concepts data we need to handle encapsulate. 

Below is a list of seismological data objects.   It probably should ultimately be listed in a priority order, but for now this list is only loosely ordered.
  1. Scalar (single channel) time series data.   Needs to include a generic, extensible header/metadata. example concept
  2. Three-component seismograms.   Needs to include a generic, extensible header/metadata. example concept
  3. Ensembles of single channel data example concept
  4. Ensembles of three-component seismograms example concept
  5. Seismic event example concept
  6. Seismic station example concept (this example encapsulates only location information, child would define response information)
  7. A catalog of seismic events (an event ensemble) example concept
  8. A seismic array/network (an ensemble of stations) example concept
  9. Continuous single channel data streams 
  10. Continuous three-component data streams 
  11. Moment tensors
  12. Spectral data
  13. Arrival times and related measurements
It is useful to separate out supporting data objects that are common in later stage data analysis, but which are not seismological sensu stricto and hence likely have more generic implementations in other communities.
  1. Two dimensional grid structures (used for a whole host of map related objects).  Note this means a generic method to describe a 2d object with discrete data and does not necessarily mean a regular mesh (perhaps should be expanded to a list of supported types).   A well defined standard already exists in netcdf and probably should just be endorsed.
  2. Three-dimensional grid structures (used for 3D velocity models and other forms of 3D imaging). Note this means a generic method to describe a 3d object with discrete data and does not necessarily mean a regular mesh perhaps should be expanded to a list of supported types.
  3. Earth models (radially symmetric to full 3D models of different physical properties)

Data Flow Model

This likely will require a careful design study when there are fewer unknowns.   An ideal system would probably be more in the realm of the concept of a "workflow" manager that would allow one to adapt legacy code, run legacy progams like SAC within the workflow, run applications on systems not universally available like Matlab and Antelope, and provide a way to self-document the workflow. 

An implementation path to get started might be define python interfaces between existing components (i.e. SAC, Matlab, etc.) and assume the programming (shell script) model for data flow.

Data Management System

Metadata

Managing metadata has a long history of people walking into dark alleys and never coming out.   Addressing this issue needs to be done with a very pragmatic, engineering approach to build on the large IT infrastructure related to "big data" problem that exists and/or are under development.   It will involve some form of database, but an ideal system should make this implementation independent.   The community need today may be in the area only a librarian could love called an "ontology".  

The other issue that comes up with metadata is how web services should be folded into a metadata management system.   Web services is ideal for maintaining definitive copies, but is know to create bottlenecks if used in "just in time" mode in data processing.  

Seismograms

All seismology data processing today is largely founded on the concept of files that store waveform data.   How that more voluminous data volume ideally interacts with metadata is an open question and a key design concept that will shape any future system's functionality.

Processing Algorithm Concepts

It is an entirely open question about how this might best be done.  Initial discussion should focus on which of the following are viable and most likely to be successful and sustainable.
  1. The classic unix pipeline model of seismic unix.
  2. Programming model as used in ObsPy
  3. Processing module concept used in all commercial seismic reflection processing systems
  4. Middleware client-server model
  5. Cluster computing model for compute intensive tasks (Map/Reduce)

User Interface

Given the current situation there is a high probability this will use python as the primary glue to construct individual workflows.  This could then evolve to specialized tasks that could be made more "user friendly" to make systems more approachable to undergraduates and early career graduate students. 

Programming Interface for Extensions

A well-defined application programming interface (API) is an essential component for a research data processing system.   Language support and key components of that API can and should develop naturally.  The KISS (Keep It Simple Stupid) acronym should be the mantra for any API definitions.

Parallel Processing Capability

The next generation of seismic processing system will be transformative only if has intrinsic support for massively parallel processing, big data concepts.  

Shared Memory Processing Support

Small scale group systems and  desktop systems already would benefit from increased use of threading since most new computers already have multiple cores.   GPUs for processing engines on desktops may also emerge as a new force.   An ideal community system will likely need to support a single node with multiple cores as the lowest common denominator but be tunable to local conditions.

Cluster Computing Support

All "supercomputers" today are large farms of multiple nodes.   The fastest today use GPUs on each node.  How seismology data processing per se can benefit from state-of-the-art supercomputers is a topic for discussion.  How the broadest part of the community can benefit from this type of resource is an even harder question.

Extensibility

Extensibility should be an important design goal of any significant development effort.   That means, a system has no intrinsic walls that limit the size of the cluster of machines that can be brought to bear on a problem.  This is an active area of research in computer science and is thus very volatile.