Ideal Future Data Processing System
SummaryThis a document to assemble ideas for elements of an
ideal, next-generation seismic processing system for seismology.
The intent is to start this document at a SIG during the 2016 IRIS
Workshop and expand this document in other forums until some
convergence is reached. Most notably students in the 2016 Data
Processing Short Course will have an opportunity to revise this
Roots of this Package
Brief Development HistoryAny contributors to this page should document what and when they added it here.
First section was begun by Gary Pavlis and Frank Vernon where it was
presented at the IRIS Workshop in June 2016. They will
initial draft of this document at that forum that will only be
preserved in the master git repository. During the
discussion the document will be
revised to reflect the broader community input. The
differences will be visible on github.
Community Our dream is that from this document will emerge
the seeds of new ideas that will flower into the next generation of
seismic processing software.
Data Objects SupportedAn ideal system should be as generic as
possible, but not so general as to become ponderous. In modern
computing generic always means abstraction to some degree or the other
and that usually means some version of object-oriented
programming. Whether an implementation chooses to use an OOP
language is a different issue. The idea here is that we should
define what key concepts data we need to handle encapsulate.
Below is a list of seismological data objects. It probably
should ultimately be listed in a priority order, but for now this list
is only loosely ordered.
It is useful to separate out supporting data objects that are common in
later stage data analysis, but which are not seismological sensu stricto and hence likely have more generic implementations in other communities.
- Scalar (single channel) time series data. Needs to include a generic, extensible header/metadata. example concept
- Three-component seismograms. Needs to include a generic, extensible header/metadata. example concept
- Ensembles of single channel data example concept
- Ensembles of three-component seismograms example concept
- Seismic event example concept
- Seismic station example concept (this example encapsulates only location information, child would define response information)
- A catalog of seismic events (an event ensemble) example concept
- A seismic array/network (an ensemble of stations) example concept
- Continuous single channel data streams
- Continuous three-component data streams
- Moment tensors
- Spectral data
- Arrival times and related measurements
- Two dimensional grid structures (used for a whole host of
related objects). Note this means a generic method to describe a
2d object with discrete data and does not necessarily mean a regular
mesh (perhaps should be expanded to a list of supported
types). A well defined standard already exists in netcdf
and probably should just be endorsed.
grid structures (used for 3D velocity models and other forms of 3D
imaging). Note this means a generic method to describe a 3d object with
discrete data and does not necessarily mean a regular mesh perhaps
should be expanded to a list of supported types.
- Earth models (radially symmetric to full 3D models of different physical properties)
Data Flow ModelThis likely will require a careful design study
when there are fewer unknowns. An ideal system would
probably be more in the realm of the concept of a "workflow" manager
that would allow one to adapt legacy code, run legacy progams like SAC
within the workflow, run applications on systems not universally
available like Matlab and Antelope, and provide a way to self-document
An implementation path to get started might be define python interfaces
between existing components (i.e. SAC, Matlab, etc.) and assume the
programming (shell script) model for data flow.
Data Management System
Managing metadata has a long history of people walking into dark alleys
and never coming out. Addressing this issue needs to be
done with a very pragmatic, engineering approach to build on the large
IT infrastructure related to "big data" problem that exists and/or are
under development. It will involve some form of database,
but an ideal system should make this implementation
independent. The community need today may be in the area
only a librarian could love called an "ontology".
The other issue that comes up with metadata is how web services should
be folded into a metadata management system. Web services
is ideal for maintaining definitive copies, but is know to create
bottlenecks if used in "just in time" mode in data
All seismology data processing today is largely founded on the concept
of files that store waveform data. How that more voluminous
data volume ideally interacts with metadata is an open question and a
key design concept that will shape any future system's functionality.
Processing Algorithm Concepts
It is an entirely open question about how this might best be
done. Initial discussion should focus on which of the following
are viable and most likely to be successful and sustainable.
- The classic unix pipeline model of seismic unix.
- Programming model as used in ObsPy
- Processing module concept used in all commercial seismic reflection processing systems
- Middleware client-server model
- Cluster computing model for compute intensive tasks (Map/Reduce)
User InterfaceGiven the current situation there is a high probability this will use python as the primary glue to construct individual
workflows. This could then evolve to specialized tasks that could
be made more "user friendly" to make systems more approachable to
undergraduates and early career graduate students.
Programming Interface for ExtensionsA well-defined application
programming interface (API) is an essential component for a research
data processing system. Language support and key components
of that API can and should develop naturally. The KISS (Keep It
Simple Stupid) acronym should be the mantra for any API definitions.
Parallel Processing CapabilityThe next generation of seismic processing system will be
transformative only if has intrinsic support for massively parallel
processing, big data concepts.
Shared Memory Processing Support
Small scale group systems and desktop systems already would
benefit from increased use of threading since most new computers
already have multiple cores. GPUs for processing engines on
desktops may also emerge as a new force. An ideal community
system will likely need to support a single node with multiple cores as
the lowest common denominator but be tunable to local conditions.
Cluster Computing SupportAll "supercomputers" today are large
farms of multiple nodes. The fastest today use GPUs on each
node. How seismology data processing per se
can benefit from state-of-the-art supercomputers is a topic for
discussion. How the broadest part of the community can benefit
from this type of resource is an even harder question.
ExtensibilityExtensibility should be an important design goal
of any significant development effort. That means, a system
has no intrinsic walls that limit the size of the cluster of machines
that can be brought to bear on a problem. This is an active area
of research in computer science and is thus very volatile.