cialis
Twisting the bytes » 2011 » May

Archive

Archive for May, 2011

Open data sets in science

May 18th, 2011 No comments

I have a question to challenge all my colleagues working with research data in Computer Science: When was the last time you could replicate a previous study, from other author(s)?

For different reasons, over the past few months I have found myself diving into the rich collection of previous research works in several areas: Wikipedia studies, libre software engineering, social media and social network analysis, to name a few. Probably, many of you already know my inborn bias towards quantitative research (but also for multidisciplinar research methods). So, it may sound totally unsurprising that most of the publications I was reviewing included empirical experiments on different datasets gathered from a wide variety of sources, target systems and virtual communities. As I was scrolling through the pages, I realized, once again, the huge proportion of research work that cannot be replicated in a easy way. Still a sad lesson to be learned, considering that, today, most of us researchers work with digital data. And bits can be duplicated or sent to the other side of the world at negligible cost.

4-digit combination padlock I already commented in my first post the curious study conducted by my colleague Gregorio Robles, about replicability of research works published in MSR. For those of you unfamiliar with MSR series, this is a working conference (formerly a workshop) devoted to the art of “Mining Software Repositories”. It is also co-located with ICSE, preeminent conference on software engineering, so it attracts the top-notch specialists in this area. One would expect that a scientific conference focused on such an empirical, hands-on activity would encourage (and even demand) the ability to access all datasets and tools used in previous experiments, in order to i) better learn the insights of different methods and practical solutions to problems in this area and ii) to make their life easier to other researchers willing to build on top of existing methods, tools and results.

Far from this, the conclusions from the replicability study were quite dissapointing. From the 171 papers published in the 6 previous editions of MSR, the most frequent case (64 papers) is that of a study that uses publicly available data sources, but it doesn’t offer access to the processed dataset (the results), or to the tools/scripts to perform that study, either. Even more worrisome is a trend discovered in these publications: as time goes by, the number of papers with publicly available processed datasets was lower!! Therefore, the situation is getting worse.
Read more…

Categories: Conferences, Open Movements Tags: