Modern science relies upon researchers sharing their work so that their peers can check and verify success or failure. But most scientists still don't share one crucial piece of information — the source codes of the computer programs driving much of today's scientific progress.
Such secrecy comes at a time when many researchers write their own source codes — human-readable instructions for how computer programs do their work — to run simulations and analyze experimental results. Now, a group of scientists is arguing for new standards that require newly published studies to make their source codes available. Otherwise, they say, the scientific method of peer review and reproducing experiments to verify results is basically broken.
"Far too many pieces of code critical to the reproduction, peer-review and extension of scientific results never see the light of day," said Andrew Morin, a postdoctoral fellow in the structural biology research and computing lab at Harvard University. "As computing becomes an ever larger and more important part of research in every field of science, access to the source code used to generate scientific results is going to become more and more critical."
Missing source codes mean extra headache for scientists who want to closely follow up on new studies or check for errors. Such unavailability of source codes can also lead to more bad science slipping through the cracks — unreleased and irreproducible codes played a part in a Duke University case that led to study retractions, scientist resignations and canceled clinical drug trials for lung and breast cancer in 2010.
But of the 20 most-cited science journals in 2010, only three require computer source codes to be made available upon publication. Morin and six colleagues from universities across the U.S. proposed making such policies universal in a policy forum paper that appears in today's (April 12) issue of the Journal Science (Science is one of the three top journals that require the availability of source codes).
Public funding or policy-setting agencies should throw their weight behind the idea of sharing source codes openly, researchers said. They also proposed that research institutions and universities should use open-source software licenses to allow for source-code sharing while protecting the commercial rights to possible innovation spinoffs from research.
"The encouraging thing is that all of the proposals we have made have already been implemented by various journals, funding agencies and research institutions in one form or another — so there is not a lot of innovation required," Morin told InnovationNewsDaily.
Many scientists have learned to write computer code without formal training, and so they may simply not know of the open-source software culture of sharing such codes, Morin and his colleagues said. Others may simply be embarrassed by the "ugly" code they write for their own research.
But even one-off computer code scripts written for a single study should undergo examination and peer review, Morin said. He has often ended up sharing, reusing or adapting code he had originally written with the intention of a single use.
"If I knew there was a publication requirement for my code, I probably would have done things like comment it better, kept better track of it, and generally put a bit more thought and effort into my code — which would have certainly helped me and others later on when I inevitably tried to reuse or share it, even if just with others in my own research group," Morin said.
- 10 Technologies Poised to Transform our World
- In the Future, We'll All Program Our Own Robots
- Top 10 Disruptive Tech Stories of 2011
Copyright 2012 InnovationNewsDaily, a TechMediaNetwork company. All rights reserved. This material may not be published, broadcast, rewritten or redistributed.




See what we're tweeting about

49 Comments
Add Comment'much of today's science progress' ???
Reply | Report Abuse | Link to this'Science' is a big world.
I've been doing geological research for almost 35 years and have no idea what this article is about. It would have helped to actaully provide examples
If you use a computer to do your research, the software you use should be free--meaning it is unrestricted and freely studied and modified. In science, people need to be able to see what you did, and that includes any method you used to collect or manipulate your data, such as your source code.
Reply | Report Abuse | Link to this
Reply | Report Abuse | Link to thisEh that's what the article basically talks about...
The other side of this issue is that, to be independently verified, results derived from the original data should be independently analyzed, meaning new analysis source code should be written. I would hate to have results vetted by code analysis. The FDA certainly doesn't allow this for medical product validation. I think this is an unprecedented requirement.
Reply | Report Abuse | Link to thisMarketing is a much more socially valuable skill than science. The REAL problem (yawn) is as old as thought- greed. Money. Power. Me/notYou. Get used to it, everyone since Adam has had to adjust. Science better not get in the way of profit! Peel my code from my cold, dead hands; or my patent agent's, whatever.
Reply | Report Abuse | Link to thisAgreed. The validity of any 'new' achievement in any branch of science should be able to be reproduced in a way that may or may not duplicate the original method. If the thought or idea stands up, it will be able to stand alone.
Reply | Report Abuse | Link to thisThis is a little simplistic.
Reply | Report Abuse | Link to thisKnowing the algorithm and having access to the data should be sufficient for replication, having the original code is not necessary.
In fact, being able to reproduce the algorithm yourself, i.e. writing your own code, is a better test than merely inspecting the original code.
I agree, to an extent, that the algorithm to produce the data is the most important thing. However, I think that it is important to have the source code independantly validated as well. The algorithm can be perfect but be imperfectly coded.
Reply | Report Abuse | Link to thisA problem would arise if you can't replicate the result. Its in bad taste to refute results if you can't show why they are wrong.
Reply | Report Abuse | Link to thisBad science is the goal of pharma.....success
Reply | Report Abuse | Link to thisSo much for everybody who insisted that "science" can be trusted because they always provide their data, their data is unquestionably valid because "scientists" can never lie, and they utilize standard, proven mathematical techniques to collate and analyze the data! So many people making so many high pronouncements about "science" when, in fact, they didn't know the first thing about what's going on! The "source code" they are referring to is the specific mathematical analysis used to "draw" "conclusions" from the "data". Anyone who withholds that basically is admitting the methods they used were engineered to get the result they wanted no matter what the data really said. And so few ask how much "science" really can be trusted.
Reply | Report Abuse | Link to thisThis seems to confuse the notion of intellectual rights in the private sector vs peer review in the public sector. I assume some private research is based on publicly funded basic science by gov't and university research. I would assume the adage of "follow the money" would determine how open the methodology should be when the goal is accreditation, or at least publication, rather than profit.
Reply | Report Abuse | Link to thisThe code refers to how the computer is instructed to correlate complex statistical input and equations, especially as multidimensional variables are involved; and thus how it generates results and confidence levels.
Reply | Report Abuse | Link to thisYes, they are algorithms. This is in answer to geojellyroll's query above.
I am sure that many geological experiments must use some strong math skills in developing computerized evaluation of density/seismic studies, not to mention the complexity introduced there by hydrological variables.
Physics might be tougher than statistical tools, but I don't know.
In reply to priddseren.
Reply | Report Abuse | Link to thisThe "Warmists" have very convinicing raw data indeed, and data from the "antiwarmist" is not convicing at all - I wonder if the industry is behind antiwarmists - no surely not,eh?
But back to the main track - Yes, Software must be open to scrutiny as well the the available raw data.Otherwise an independant verification (or not) is impossible.
I agree. Being able to articulate the algorithm is more important than being able to code it. Further, different languages implement an algorithm differently. There is little value inherent in the inevitable criticisms and rebuttals of “source code validity”.
Reply | Report Abuse | Link to thisI don't understand how knowing an algorithm would allow anyone to reproduce the results obtained by often very complex normalizations of many disparate data sets.
Reply | Report Abuse | Link to thisEven if the applied algorithms were known, often there are many ways to correctly, or incorrectly, arrive at an intended solution using several alternative algorithms.
Any source code, be it as humble as an "if ... else ... end-if" is an expression of an algorithm, a set of rules for solving a problem in a finite number of steps.
Reply | Report Abuse | Link to thisxingo had stated:
Reply | Report Abuse | Link to this"Knowing the algorithm and having access to the data should be sufficient for replication, having the original code is not necessary."
All references in preceding comments refer to "the algorithm" that, when applied to a set of data produced the correct solution. If the entirety of logical transformations applied to the entirety of all requisite data (i.e., the entire program) can be referred to as "the algorithm", then, technically, you may be correct. However, my interpretation would be that very large number of discrete algorithms would be necessary to produce often complex data analyses required to adequately evaluate the results of many scientific research projects.
I suspect that most scientists do include the discrete mathematical algorithms applied to specific data as a prerequisite for acceptance by peer reviewed research journals but, as this article suggests, the published algorithms are not sufficient to reproduce a comparable analysis. Inconsistent processing of intermediate data and results that are not published would likely produce conflicting results.
"xingo had stated:
Reply | Report Abuse | Link to this"Knowing the algorithm and having access to the data should be sufficient for replication, having the original code is not necessary." ..."
xingo is correct.
Or simply ignore your results.
Reply | Report Abuse | Link to thisWhat data after they came out with the U.N. report that started the debate, If I recall, they said there was to much data to store and it was not needed any more. I think it was to cover up for some thing.
Reply | Report Abuse | Link to this"... they said there was to much data to store and it was not needed any more ..."
Reply | Report Abuse | Link to thisInteresting. I had not read that. What are the particulars?
"... I think it was to cover up for some thing."
Yes, Caesar's wife must be above suspicion, but that assumes something to cover up. What might that be?
"xingo had stated:
Reply | Report Abuse | Link to this"Knowing the algorithm and having access to the data should be sufficient for replication, having the original code is not necessary." ..."
You stated:
"xingo is correct."
Fine - I'd be most interested in a demonstration.
Please reproduce the conclusions of a recently referenced research report: Bonfils et al, (2011), "The HARPS search for southern extra-solar planets XXXI. The M-dwarf sample", http://arxiv.org/abs/1111.5019
The published report contains several specific algorithms and several pages of tabular data to support your reanalyses.
I suspect you will find this to be a daunting assignment - one that would be exceedingly difficult to if not impossible to fulfill even now, so soon after publication.
I don't believe 'xingo' and your assertion is correct.
You stated (in response to xingo's comment #8):
Reply | Report Abuse | Link to this"I agree. Being able to articulate the algorithm is more important than being able to code it. Further, different languages implement an algorithm differently. There is little value inherent in the inevitable criticisms and rebuttals of source code validity."
The point of this article is that the entire set of logic applied to scientific analyses is not available through published sources, making it impossible to reproduce the results of scientific research reports from published information.
Providing any pseudo-code representation of the detailed logic applied to any analysis would in effect provide the 'source code' availability suggested by this article to be necessary for a complete replication of research results. It would be far simpler to publish the source code rather than render a complete and accurate representation of its logic in some other form...
"... Providing any pseudo-code representation of the detailed logic applied to any analysis would in effect provide the 'source code' availability suggested by this article to be necessary for a complete replication of research results ..."
Reply | Report Abuse | Link to thisCorrect. The mathematical expression of the algorithm is more concise and generally comprehensible than that in C, C++, Lisp, Algol, or VBA.
That's why it, the mathematical expression, is preferable to any of those.
"... Fine - I'd be most interested in a demonstration."
Reply | Report Abuse | Link to thisSure. I'll write the code if you pay me to do so. That's what I do for a living. You'll need to get the TechDoc written, etc. although I suspect you will find even that to be a daunting assignment.
If you're interested in seeing a commercial example of astronomical simulation, visit http://www.starrynight.com/
You may be a fine programmer, but I'm afraid you're missing the point.
Reply | Report Abuse | Link to thisResearchers do not provide potential future reviewers with any formal "TechDoc" that might be necessary to reproduce programs (and/or manual procedures, spreadsheet definitions, etc.) used in the analyses that produced their conclusions. The only 'specifications' provided are those described in the published research report (or any that they might be persuaded to provide later).
I've already provided the only documentation available to most reviewers - the link to the published research report. Can you reproduce the programs and other processes necessary to reproduce the conclusions of complex research analyses from published research reports? I think not!
"... Researchers do not provide potential future reviewers with any formal "TechDoc" ..."
Reply | Report Abuse | Link to thisYou are, of course speaking from experience, correct? As someone we both know might say, "I think not!"
Actually, that's what this wild-goose chase will end up producing, as researchers will, to refer to the article above "... I probably would have done things like comment it better, kept better track of it ..." In other words, spend research time documenting their programs for publication, rather than doing Science, unless documenting code is "Science", which it may be, at least to some.
To continue, open the document at the link you provided, go to the Data Analysis section (Page 5).
"... Often, statistical tests are applied to the time series in order to appraise the significance of trends or variability Then, the time series are searched for periodicities and, if a significant periodicity is found, the corresponding period is used as a starting point for a Keplerian fit. Again, statistical tests are applied to decide whether a sinusoidal or a Keplerian model is a good description for the time series. In this section, we follow the same strategy ..."
This type of analysis is amenable to COTS software. Here is a link to an Excel Stat package that fits the bill. http://tinyurl.com/7xghpg9
Nether you nor I need write any code at all to perform the statistical analysis.
In conclusion, for those who need help understanding the published results, a formal algorithm of the transformations performed upon the measurements is all the "pseudo-code" needed. Further it obviates the need to annotate in a specific language, another benefit.
If you actually believe that better documentation is the pathway to "better Science", perhaps you could advocate more formalism there.
In your example 'specification':
Reply | Report Abuse | Link to this"Then, the time series are searched for periodicities and, if a significant periodicity is found, the corresponding period is used as a starting point for a Keplerian fit."
What specific periodicity did the researchers consider to be 'significant'? Without specific information regarding a detailed analysis, your results would vary.
If I recall correctly, this discussion arose in response to your agreement with xingo's assertion that:
Reply | Report Abuse | Link to this"Knowing the algorithm and having access to the data should be sufficient for replication, having the original code is not necessary."
I assert that much more detailed information would be necessary to reproduce research results, such as the "TechDoc" you requested. Beyond the suggestions of this article, research reports are missing far more information about transformational information processes employed during analysis than just program source code.
"... What specific periodicity did the researchers consider to be 'significant'? ..."
Reply | Report Abuse | Link to thisWithin the domain of the statistical analysis being performed, that would be what is commonly referred to as 'a statistically significant' periodicity.
http://en.wikipedia.org/wiki/Statistical_significance
"... Without specific information regarding a detailed analysis, your results would vary ..."
That's what email is for, and yes, researchers do communicate via email.
To flesh out the COTS proposal, here is a link to various and sundry Trend Analysis tools for Excel.
http://processtrends.com/toc_trend_analysis_with_excel.htm
As an alternative to the ubiquitous Excel for dynamic visualization to predictive modeling, there’s SAS
http://www.sas.com/technologies/analytics/index.html
"... I assert that much more detailed information would be necessary to reproduce research results, such as the "TechDoc" you requested ..."
Reply | Report Abuse | Link to thisFeel free to assert whatever you please. Researchers routinely communicate via email. I refer you to the "Climategate" fracas, not to challenge AGW, as I find that theory very persuasive, but rather to show the ubiquity of academic communication in one of the Sciences.
http://en.wikipedia.org/wiki/Climatic_Research_Unit_email_controversy
"... Beyond the suggestions of this article, research reports are missing far more information about transformational information processes employed during analysis than just program source code ..."
Then criticize the Journals' editorial staff, and the article referees, all of whom accepted that the document was sufficient as presented. There are many ways to bring forth more information - in contrast to data - if that is what you feel would produce better Science. Documenting source code is one of the less efficient ways of doing so.
jtdwyer,
Reply | Report Abuse | Link to thisMore Science, and more rigorous Science will be produced by publishing the data and articulating the methodology applied to that data (the algorithm) to produce the published interpretations in a coherent and lucid form than will be by explaining through commented source code why a While ... Wend programming construct was employed in lieu of a For ... Next loop. If the algorithm can be articulated mathematically, so much the better, but if declarative prose is also required to clarify intent, then the combination - when combined lucidly - will remain preferable to some specific implementation that may or may not be practical at another's facility.
I have presented a persuasive (if, alas, not a convincing) case as to why the publication of source code will not be any type of Scientific Silver Bullet as Jeremy Hsu would have us believe. Still, if you feel otherwise, then by all means assert whatever you feel would be ideal.
um, climate models use open-source code.
Reply | Report Abuse | Link to thisSo long as there is a lack of essential integrity and honor about respect for intellectual property and appropriate attribution, the kind of cooperation considered essential to the complaint in this article will remain IMPOSSIBLE. CHINA engages the most brazen looting/theft of intellectual property in the history of mankind...and with total impunity and CONTEMPT for the claims of anyone to ownership of the stolen material. This kind of theft is WORSE than a nuclear exchange because it provides the means for CHINA to STEAL economic dominance across the globe at the expense of every nation on Earth. This theft MUST be stopped...even by use of nuclear weapons, if necessary. NOTHING so far has compelled obedience of CHINA to the reasonable demands of the rest of civilization, so PISS on CHINA. They have pissed on the entire world until now. Go nuke them ALL. They fear no consequence because so far THERE HAS BEEN NONE>
Reply | Report Abuse | Link to thisWhat a shock it would be if the intelligence of humanity were found to be little more than a malignant kind of adaptive response with toxic effects on all life forms including mankind? How ironic, indeed.
Reply | Report Abuse | Link to thisOh, so you don't agree with xingo that:
Reply | Report Abuse | Link to this"Knowing the algorithm and having access to the data should be sufficient for replication, having the original code is not necessary."
So sorry - I must have been mistaken.
"... So sorry - I must have been mistaken."
Reply | Report Abuse | Link to thisInsofar as you assert "better Science" through "better documented source code", you are, indeed, mistaken.
True. I'd prefer a fully detailed algorithm and all the data to a - possibly - imperfect implementation. Let's face it, suppose I had someone's code, then what? I'd first have to find out just what he's saying, then I'd have to prove that he's (in)correct etc. Too much work. Just give me the algorithm with a formal prove of correctness, and I'm sure I'd be able to come up with a decent implementation of my own. Much better.
Reply | Report Abuse | Link to thisYou stated:
Reply | Report Abuse | Link to this"Insofar as you assert "better Science" through "better documented source code", you are, indeed, mistaken."
You've gone too far - I have not used the words you attribute to me nor even expressed their sentiments. Nonsense!
Some for-profit companies like Brighter Planet (disclosure: I work there) are leading the charge for transparent scientific code. For example, 2 API calls that actually show calculation methodology alongside the final answer:
Reply | Report Abuse | Link to this(readable by humans)
http://impact.brighterplanet.com/flights?airline=aa&segments_per_trip=1&trips=1&origin_airport=msn&destination_airport=ord
(readable by machines)
http://impact.brighterplanet.com/flights.json?airline=aa&segments_per_trip=1&trips=1&origin_airport=msn&destination_airport=ord
Also, the source code is open-source:
https://github.com/brighterplanet/flight
"... You've gone too far - I have not used the words you attribute to me nor even expressed their sentiments. Nonsense!"
Reply | Report Abuse | Link to thisExcellent! Glad to learn that you disavow the sentiment " 'better Science' through 'better documented source code' ".
There are many ways to skin the cat of better comprehension in publication, ref here seamusabshere's posting above as yet another alternative. Let the machine do the drone work, and let the humans be creative!
As a general rule, there is no "the code" to make available. We're often taking about multiple workflows, multiple programs, utilities, DMBSs, data from spreadsheets, R programs, Mathematica, SAS, BMDP, supporting infrastructure (hardware and software).
Reply | Report Abuse | Link to thisThe code itself, even if simple and available, is surrounded by maintenance code, access code, special code for interacting with the local infrastructure. There is likely nothing of scientific interest in this code. If code is important, it will be some specialized and key algorithms that are of interest.
A demand for full access to all source code involved is quite silly, it seems to me.
After glancing through the comments, I've come to two conclusions: 1. Few of the writers actually write or read software for a living. If you did you wouldn't be making some of the absurd comments. 2. Even fewer of you know anything about computer arithmetic. If you did, you wouldn't even think of the absurd comments you've made.
Reply | Report Abuse | Link to thisChoice of the language, tools, processing order, libraries, operating system, and underlying processor ALL effect the results. Most importantly, the way a mathematical function is implemented greatly effects the results.
I've seen badly written, poorly documented, and improperly implemented (from a numerical perspective) code in science. As a computer scientist, I have come to doubt any conclusion based on junk science software.
Every grad student who thinks they might write software should take a CS class in software project management and numerical analysis.
See the documentation around the "Correctly Rounded LIBM" http://lipforge.ens-lyon.fr/www/crlibm/ (or any book on numerical analysis or computer arithmetic).
I did not have the source code to Python when I did see Herb Schildt's example for a small basic interpreter 20 years ago. I stopped using spreadsheets when I learned to program the parser into an executable DOS command line algebraic expression reader. In many ways this utility remained superior to many high level languages because of its sheer simplicity and capacity to do multiple operations across columns in one line. Ironic, that to compile something like this for Linux requires a large gcc package in a distribution. That's where Microsoft's .NET framework is trying to wean people away from thinking for themselves and getting their feet wet. With 64 bit Windows os's I have to go back to the source code anyway and to recompile it for win32 to keep those DOS utils compatible. Too bad that many universities demand computer hardware from their undergraduates simply to run software that is often protected by cryptic nonsense, often just to do some calculations and plot a graph.
Reply | Report Abuse | Link to thisI strongly support the push for more open code in science. It's why I have released the most significant program I have used in my own science under the GPL - you can get it at burrow-owl.sourceforge.net. If you don't know what "GPL" stands for, you're part of the problem. It's not good enough to say, "just publish the algorithm." Half the time people don't even do that. And even if they do, that's not good enough. It's not some unusual, rare thing for a program to fail to behave correctly in all cases. It's the norm, unless that program is widely used and tested, like MySQL or R. (And "me and five other people in the lab used it, and it seemed to work" does not constitute "widely used and tested.")
Reply | Report Abuse | Link to thisA (new, small) journal with a requirement for a source code. Moreover, the source code is reviewed (otherwise, it could be anything, and unusable), and must be documented, portable and free.
Reply | Report Abuse | Link to this-> Image Processing On Line (IPOL) http://www.ipol.im/
The authors seem satisfied by the experience:
-> http://www.ipol.im/news/20111219_satisfaction/
(disclaimer: I am co-founder of the journal)
If your scientific work is derived from Federal Grant Money, the work, work process, and results should be PUBLIC PROPERTY. Not you own personal profit center.
Reply | Report Abuse | Link to thisNo sharing of the work, no sharing of the Grant Money.
I'd argue that it's not so much that things "can be incorrectly coded". You have to assume that they are if the code has only been seen by a handful of amateur programmers.
Reply | Report Abuse | Link to this