| IUCr Home Page
| CIF Home Page
| CBF
| NeXus
|
| BioSync
| MEDSBIO list
| MEDSBIO list archive
| Meetings
|
| Make a Submission to the MEDSBIO web site
|
There is a complex relationship among raw experimental data, derived data and experimental models used in structural biology. There are strong collaborative efforts that help to achieve coherence and consistency in the nomenclature and representation of derived data and of experimental models in structural biology. There are many existing efforts in the management of raw data, wherein lies a problem. Each vendor of data collection equipment defines their own data acquisition protocols and data formats. Each synchrotron beamline development group layers their own data acquisition protocols and formats on top of and sometimes in place of a variety of vendor formats. Multiple collaborations have developed to reduce the complexity of raw data management data protocols in structural biology. For image data in synchrotron-based protein crystallography we have both imgCIF/CBF and NeXus from collaborations as well a multiple vendor image formats, with not only different formats for different detectors, but even with different formats for the same type of detector. If we do not bring the imgCIF and NeXus collaborations together with some significant number of vendors to establish clean, well-documented relationships among the formats, instead of standards resulting in coherence, they may add to the chaos as poorly documented variants of "standards" emerge. We are not certain that this risk can be, or even that it should be, avoided completely. Perfect standardization could suppress creativity and scientific development. We are creating a new consortium on the Management of Experimental Data in Structural Biology (MEDSBIO) not to enforce standardization on a single data management protocol, but to document clearly the interfaces among protocols, so that individual experimental efforts working in the intersection of multiple protocols can function as efficiently as possible and so that the competition among standards can be resolved as an open competition of ideas to the betterment of the science involved, rather than as a political exercise.
The goals of the MEDSBIO consortium are to collaboratively resolve the interface issues among multiple structural biology data management protocols, including imgCIF, NeXuS, vendor data formats, instrument control and signaling protocols, local and remote experiment control protocols, etc. with the objective of making the collection, transfer and archiving of data for experiments in structural biology as efficient as practicable; maintain an archive of documentation on standards and proposals for ontologies, software, hardware specifications, web templates and other documentation related to such protocols; maintain an archive of open source software and links to closed source software related to such protocols; maintain a archive of samples and test cases related to such protocols; run annual workshops on issues relating to such protocols; contribute open source software to fill gaps in the infrastructure related to such protocols; gather and where necessary create curricular material to assist in training experimenters in issues related to such protocols.
These efforts are primarily focused on the fine details of data acquisition, of managing raw data in hardware and software in ways that conserve resources. These are issues that users of this data often gloss over or do not consider at all. For the users, data derived from the raw data, e.g. structure factors derived from pixel-by-pixel photon counts are the primary data, to be provided by "black-box" systems. MEDSBIO is concerned with issues in the innards of those black boxes. There is a strong relationship between these internal issues and the issues that users must confront. They are connected by the data and the representations of the derived data required by the users. Thus if a particular user community were to standardize on, say, imgCIF for their "raw "data in a synchrotron environment using, say, NeXus, for its overall data management, working with detectors using an idiosyncratic detector element coordinate system, the users well might wish to be isolated from NeXus and the oddities of the detector coordinate system, but the beam line designers need to have a detailed, well-documented understanding of how to interface among all the messy innards that the users never wish to deal with. If this is not done well and done in a consistent manner at multiple beam lines, then, instead of imgCIF providing a standard, it will exist in multiple, difficult-to-translate dialects.
Because end users and developers have a lot in common and are tied together by the data itself as it is transformed from raw images, photon counts, axis settings, etc., it is important that there be collegial collaboration between people working on problems on both ends of the data stream, but it is equally important to allow the technical issues on the raw data side to be fully discussed and explored without being swamped by the equally demanding discussions needed on the derived data side. Therefore it is important to have a collaborative consortium in the developer community that is neither focused on a single data management protocol, nor dominated by discussions of derived data user-level issues.
The MEDSBIO consortium formalizes several existing collaborations and introduces a new level of coordination and cooperation in working with raw experimental data of importance in structural biology, complementing well-established efforts in working with the data derived from this raw data, hopefully producing a better understanding of the data upon which the much experimental work in structural biology is based and an understanding of the issues which affect the quality and reliability of that data. By clarifying and codifying the parameters of the information streams that interact to produce the raw data, we hope to bring a new level of consistency and coherence to the presentation of scientific results of the experiments that depend upon this data, thereby facilitating reliable intercomparisons among experiments and facilitating analysis based upon the results of multiple experiments.