Sunday, 17 August 2014

The problem of data discovery for invasive species

At a recent conference on invasive species someone from the audience requested that there should be fewer databases, to which there was a general muttering of agreement. Can you imagine if someone had said the same thing about books? Surely only an anti-intellectual would want fewer books. However, I perfectly understand the frustrations that lead to this request. All the time there are new launches of databases and websites on alien species, they never have precisely the same remit, but they will always have overlapping interests with other databases. For a time-pressed user, who is scouring the internet for a simple answer to a simple question, there is a bewildering array of data sources. Some are highly visible, others are hidden; some are nice looking, but superficial, while others are mines of information but hard to navigate. The same thing could be said of books, but people expect more of the internet where they were promised a bright future of connectedness and interoperability.

Yet we are not going to get fewer databases any time soon, current funding models and the missions of providers restrict us. Funders want to see clear results from their investment and providers want to create something new for their effort. It is currently hard to achieve those aims if they are tied to a single product. There are other problems, too. Single products don't handle differences of opinion well and local issues related to culture and language are not well suited to a monolithic approach. Furthermore, data is best managed by the people with most interest in it. They have the most incentive to gather new data and they are the experts in their specialisms.

We all want discoverable, accurate, up-to-date information on alien species that are sustainably managed, but for all the reasons I've mentioned we can't yet have fewer databases. Nevertheless, there are many things we can do to improve the situation, which include federation, openness and standardisation.


There is duplication of effort in our invasive species databases. Taxonomic names, common names, references, observation data and specimen data are repeatedly entered and curated by each database independently. This does not have to be the case. We can federate out some of our work to providers who specialise in those data. For example, all modern scientific publications have a Digital Object Identifier (DOI). This simple, but unique identifier is the key to all the bibliographic information on a publication. DOIs are maintained by the publishing industry who will do a better job of looking after this domain of data than we will. In our databases it is only necessary to store the DOI and derive other information from the DOI resolvers. ORCIDs are another example; they are an open, self administered system for uniquely identifying a scientist. They are administered by the scientists themselves who are best placed to do the job and we potentially only need to store the ORCID in our database.
Federation has the potential to reduce costs, while at the same time improving standards, improving sustainability and helping us to concentrate on our core interest of invasion biology.
Nevertheless, if we federate some services we need to trust those services to provide the information we need, at a price we can afford and for those services to be provided for the long term and reliably. DOIs and ORCIDs are supported by the publishing industry and by large academic institutions. Other infrastructures to which we could federate responsibilities might be the Global Names architecture and the GBIF. These infrastructures need communities such as ours to justify their existence, but similarly we might benefit considerably from their domain expertise and investment.


Working within a framework of standards for data quality can be frustrating, particularly in an emerging discipline where standards often seem to be unnecessarily constraining. There is a temptation for everyone to invent their own ‘standard’, yet the advantages of standardization are numerous. We should always be looking to other disciplines to reuse and build upon their standards. Standards that are extensible can provide a flexible approach.
The ability to combine digital resources is a fine goal of standardisation, but to do it we need to understand each other’s data. Wherever possible, we need to explain and annotate our data. Using common standards is a good first step, but we also need to ensure that the metadata is kept up-to-date and accurate. Many people will have noticed the problem of data aggregators where the meaning of data can subtly change as it is transferred from one database to another. The creation of domain ontologies can clarify the meaning of terms without necessarily constraining the development of new data sources. This is a comparatively new field within computer science, but one that should be explored for invasion biology.


Even if we don't want it, we get copyright automatically and are stuck with it for many years after our death, unless we ensure that each of our works is openly licensed. You can't deny the usefulness of open resources such as Wikipedia. And yet, it is perhaps most remarkable for its success in mobilizing data providers. However, scientists are often afraid of openness, thinking that others will ‘steal’ their work and not give them sufficient credit. However, often the reverse is true. Open licensing promotes data discovery and experts can use it to promote themselves through their expertise not the data they hold. Work still needs to be done on providing traceable citations for data, but scientists already have mechanisms for doing this, such as so-called data publications. Scientists also need to become more educated about copyright, as data per se can’t be copyrighted.

To conclude, making data more accessible and discoverable is not an easy task, yet the tools and practises to do it are available to us. What is needed is a change in culture, not necessarily towards having monolithic databases, but towards sharing, openness and connectivity. It will take some investment of resources and progress might seem slow at first, but eventually we can build a global infrastructure for invasive species that satisfies our needs and we don’t necessarily need to have fewer databases to do it.

