Bellis Chains and Cabbage Leeks: August 2013

Saturday, 24 August 2013

A new site for a rare species

Yesterday, the family and I took a boat trip to the tiny island of Burhou, off the coast of Alderney in the Channel Islands. Burhou is a reserve for puffins (Fratercula arctica) and storm petrels (Hydrobates pelagicus). However, there weren't many birds to see as the puffins have already finished breeding and the storm petrels only return to their nests at night. Nevertheless, this gave me plenty of opportunity to look for plants, without disturbing nesting birds.

Fewer than 50 species of plant have ever been recorded from Burhou and nothing rare, so it it came as a surprise to me to not only find a new species for the island, but one of the rarest British species, Rumex rupestris (shore dock). Britain is the world stronghold for Rumex rupestris, but even here the total population is estimated to be less than 650 plants (see a full description of its status at the JNCC website). The last record of this species nearby was in 1958 on Alderney.

Below is a picture of Rumex rupestris showing its large smooth tubercules and Rumex crispus like leaves.

This work by Quentin Groom is licensed under a Creative Commons Attribution 3.0 Unported License.

Friday, 23 August 2013

My (Lack of) Scientific Impact

With initiatives such as the San Francisco Declaration on Research Assessment (DORA) I look forward to the decline in use of the journal impact factor as a measure of my scientific worth, but it is not going to do me any good. Distributions of wealth in society invariable follow a Lorenz curve, such that a small proportion of the population holds the vast majority of the wealth. Scientific impact is no different, Darwin’s shopping list would have probably had more scientific impact than most of my publications.
To see what my scientific impact looks like visit my profile at ImpactStory. Here ImpactStory have taken the publications linked to my ORCID and searched numerous social networks and bibliographic databases to discover the scientific impact those publications have had. It is an elegant assessment of my scientific career, though starkly honest. Reading it you get a good impression of the things I've apparently done wrong in my career...

I shouldn't have kept changing my research field.

Molecular Biology, Biochemistry, Plant Physiology, Ecology etc.

I shouldn't have taken breaks in my scientific career to follow different careers.
I shouldn't have written papers that no scientist would be interested in reading

e.g Observations on the occurrence of Cirsium ×hybridum in Belgium

I shouldn't have written papers for obscure journals.

e.g. Schriften des Naturwissenschaftlichen Vereins für Schleswig-Holstein

I shouldn't write for a non-scientific audience.

e.g. Rare and Scarce Plants of South Northumberland

I should write blogs on social networking sites mentioning all my publications.

Nevertheless, citations and mentions on social networks still miss much of my impact on science, due to the work I've done on the internet. I get no credit for digitising and mobilising the Flore d'Afrique Centrale, the Maps Scheme Database of the BSBI, the Vice County Census Catalogue, the Flora of North-East England, for Find Wild Flowers etc. Obviously, scientific impact, even with internet-based metrics, is still geared to peer-reviewed scientific publications. Obviously, for each of these projects I could have written a citable paper describing the resource. One often sees such papers written for databases and software, but they rarely contain more information than the website itself.

So I can feel a bit better about my scientific impact in ImpactStory. It doesn't truly reflect all of my impact on science and it is a reminder to assess people's scientific worth broadly. I do not regret the more obscure, non-impactful things I've worked on, because I'm interested in them. However, if I want people to fund my research I need to work on things other people are interested in.
Now that our impact can be judged more directly on the internet, each of us has to work to ensure our publications attract interest. Yet, we should not loose sight of the fact that, at the end of the day, it is the quality of the science that is really important, not your number of retweets.

Wednesday, 21 August 2013

GBIF should store gridded data and stop converting it to point-radius

Why are modellers still coming to me for data, rather than taking it straight from Global Biodiversity Information Facility ( +GBIF )? I believe GBIF and the creators of the Darwin Core standard (Wieczorek et al., 2012) made a mistake in judgment by adopting the point-radius method of location description above all other forms.

Biodiversity locality data is collected in four forms...

Point data, sometimes with an indication of the accuracy.
Gridded data, where the observation is located to within a predefined grid of a geographic coordinate system.
Area data, where the occurrence is located to a defined area such as a country, province, state, county, nature reserve etc.
Site description data. Ill-defined locality descriptions (e.g. 5 km west of Newcastle), which are often supported by one or more of the former forms of geographic locator.

On the face of it, point data is superior to all other forms of data as it accurately locates the individual and can seemingly be converted to any of the other three forms. This is perhaps why, when you download data from GBIF all the data is in this format. It is stored in the Darwin Core standard in the fields decimalLatitude, decimalLongitude, verbatimLatitude, verbatimLongitude and coordinatePrecision.
The problem is that the vast majority of biodiversity observations are collected and analysed using gridded data. You couldn’t collect point data for every single individual. For the vast majority of organisms if there is one individual, it's fairly certain another one will be nearby¹. Apart from a few rare species and for individual specimens there is little point and certainly no time to collect point data. Furthermore, almost all the environmental data for modelling such as climate, pollution, soil and land-cover are stored as gridded data. Indeed, even when they are not, they are interpolated and converted to gridded data for species distribution modelling.
Unfortunately, GBIF takes gridded observation data and converts them to point data using the centre of the grid square and an error radius that encloses the grid square (i.e. a 36% larger area)². So if you want to convert these data back to gridded data, you either have to choose a larger grid than the original to ensure that the point and the error radius are contained within the grid square³, or recalculate the edges of the square and the error using the radius of the circle as half the diameter of the grid square that contains it. Though you can only do this if you know that the data was gridded in the first place, otherwise you end up with squares that overlap each other.
The origin of this approach stems from the requirements of one community, the museums and herbaria who geo-reference their collections using the point radius method (Chapman & Wieczorek, 2006). It ignores the vast majority of data collectors and data users; the ecologists, conservationists and modellers. Indeed, the Darwin Core standard can handle gridded data, but rather clumsily using the footprintWKT field⁴. Most databases containing gridded data hold the position of the south-west corner of the grid square, the size of the grid square and a description of the spatial reference system.
GBIF surely wants to be a one-stop-shop for biodiversity modelling data so it should stop converting most of the data to a format that can’t be used without converting it back to the original format, if you are lucky enough to know which one it was in the first place.

Footnotes

1. This is Tobler's first law of geography, "Everything is related to everything else, but near things are more related than distant things." Tobler W., (1970) "A computer movie simulating urban growth in the Detroit region". Economic Geography, 46(2): 234-240.
2. coordinatePrecision for gridded data in GBIF seems to have been interpreted in various ways and you have to study it from each data provider to understand what it means.
3. If you use a larger grid for modelling you are not looking at data at a different scale as is sometimes suggested, you are just losing definition. If your computer monitor had bigger pixels you wouldn’t say you’re looking at the image at a different scale.
4. The footprintWKT field is not available in GBIF downloads.

References

Chapman, A.D. and J. Wieczorek (eds). 2006. Guide to Best Practices for Georeferencing. Copenhagen: Global Biodiversity

Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, et al. (2012) Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS ONE7(1): e29715. doi:10.1371/journal.pone.0029715

Sunday, 18 August 2013

False positives are the worst kind of error

If in doubt leave it out

Misidentifications and unnoticed species are quite common in botanical surveys. Both lead to errors and misinformation. Failing to observe a species, when it is in-fact present, is a false negative. False negatives are fairly easy to counteract by repeat surveying and using multiple observers. False negatives can even be used to quantify the abundance of a species, since the single largest determinant of observability is abundance (Royle & Nichols, 2003; Kéry, Royle & Schmid, 2005; Chen, 2009, Groom, 2012).

On the other hand, false positives are a menace as they are much more difficult to resolve. They pollute our datasets and are impossible to refute with certainty. +MichaelShermer suggests that the number of false positives are greater when there is a cost associated with a false negative. For example, if the risk of a false negative were to be eaten by a tiger it would be more advantageous to make a few false positive identifications.

In the case of botanical surveying the costs are probably fairly well balanced between false negatives and false positives. However, under certain conditions one can imagine that the cost swings in either direction. For example, if a keen amateur wants to prove their botanical prowess, there is a cost to one’s ego associated with a false negative. On the other hand, a poorly paid professional ecologist may want to complete as many plots as possible in a short period, in which case the cost of a false negative reduces in comparison to a false positive.

False positives arise from several different behaviors.

An over-reliance on jizz, leads to dismissive identification without due consideration.
Inexperience of recorders, unaware of all the possible taxa that might occur.
Inadequate reference material, not including all the possible taxa.
Poor navigation, so that surveyors are outside of the survey area.

So what can be done to reduce the number of false positives?

Insist on a specimen for previously unobserved taxa. This can be done at a national and county level.
Training, not just in the identification of taxa, but also in navigation and in the consequences of misidentification.
When analyzing, grade observations by their sources and level of evidence (Molinari-Jobin et al., 2012).
Survey in a group: groups create fewer false positives (Wolf et al., 2013).
Don’t foster a culture where a long list is a good list. Foster a supportive, open, nonjudgmental culture where peer review is welcomed.
Using computer software that alerts the user if that taxon is new to the area. Never store records on paper or in spreadsheets.
Use a survey protocol whereby each new taxon has to be checked against at least one key character.

False positives litter our databases. They are made by everyone and in-fact experienced botanists, particularly the most confident, can be the worst offenders. Once these errors are made they contaminate the data misleading researchers and leading to many waste hours of confusion, if in doubt leave it out.

References

Chen, G., Kéry, M., Zhang, J., & Ma, K. (2009). Factors affecting detection probability in plant distribution studies. Journal of Ecology, 97(6), 1383–1389. doi:10.1111/j.1365-2745.2009.01560.x

Groom, Q. J. (2013). Estimation of vascular plant occupancy and its change using kriging. New Journal of Botany, 3(1), 33–46. doi:10.1179/2042349712Y.0000000014

Kéry, M., Royle, J. A., & Schmid, H. (2005). Modeling Avian Abundance From Replicated Counts Using Binomial Mixture Models. Ecological Applications, 15(4), 1450–1461. doi:10.1890/04-1120

Molinari-Jobin, A., Kéry, M., Marboutin, E., Molinari, P., Koren, I., Fuxjäger, C., … Breitenmoser, U. (2012). Monitoring in the presence of species misidentification: the case of the Eurasian lynx in the Alps. Animal Conservation, 15(3), 266–273. doi:10.1111/j.1469-1795.2011.00511.x

Royle, J. A., & Nichols, J. D. (2003). Estimating Abundance From Repeated Presence–Absence Data or Point Counts. Ecology, 84(3), 777–790. doi:10.1890/0012-9658(2003)084[0777:EAFRPA]2.0.CO;2

Wolf, M., Kurvers, R. H. J. M., Ward, A. J. W., Krause, S., & Krause, J. (2013). Accurate decisions in an uncertain world: collective cognition increases true positives while decreasing false positives. Proceedings. Biological sciences / The Royal Society, 280(1756), 20122777. doi:10.1098/rspb.2012.2777

Addendum
A presentation given to the Botanical Society of the British Isles on this subject is available here.

Bellis Chains and Cabbage Leeks