The following is a draft content mining declaration developed by the Open Knowledge Foundation’s Working Group on Open Access

In brief: The Right to Read Is the Right to Mine

##Introduction

Researchers can find and read papers online, rather than having to manually track down print copies.  Machines  (computers) can index the papers and extract the details (titles,  keywords etc.) in order to alert scientists to relevant material.  In addition, computers can extract factual data and meaning by “mining” the content, opening  up the possibility that machines could be used to make connections (and  even scientific discoveries) that might otherwise remain invisible to  researchers.

However,  it is not generally possible today for computers to mine the content in papers due to constraints imposed by publishers.  While Open Access (OA) is improving the ability for researchers to read papers (by removing  access barriers), still only around 20% of scholarly papers are OA. The  remainder are locked  behind paywalls. As per the vast majority of subscription contracts, Subscribers may read paywalled papers, but they may not mine them.

Content  mining is the way that modern technology locates digital information. Because digitized scientific information comes from hundreds of  thousands of different sources in today’s globally connected scientific  community [2] and because current data sets can be measured in  terabytes,[1] it is often no longer possible to simply read a scholarly  summary in order to make scientifically significant use of such  information.[3]  A researcher must be able to copy information,  recombine it with other data and otherwise “re-use” it so as to produce  truly helpful results.  Not only is it a deductive tool to analyze  research data, it is how search engines operate to allow discovery of content. To prevent mining is therefore to force scientists into blind  alleys and silos where only limited knowledge is accessible.  Science  does not progress if it cannot incorporate the most recent findings and  move forward from there.

##Definition

‘Open  Content Mining’ means the unrestricted right of subscribers to extract,  process and republish content manually or by machine in whatever form  (text, diagrams, images, data, audio, video, etc.) without prior  specific permissions and subject only to community norms of responsible  behaviour in the electronic age.

  • Text
  • Numbers
  • Tables: numerical representations of a fact
  • Diagrams (line drawings, graphs, spectra, networks, etc.): Graphical  representations of relationships between variables, are images and  therefore may not be, when considered as a collective entity, data.  However, the individual data points underlying a graph, similar to  tables, should be.
  • Images and video (mainly photographic)- where it is the means of expressing a fact?
  • Audio: same as images – where it is expresses the factual representation of the research?
  • XML:  Extensible Markup Language (XML) defines rules for encoding documents  in a format that is both human-readable and machine-readable.”<
  • Core  bibliographic data: described as “data which is necessary to identify  and / or discover a publication” and defined under the Open Bibliography  Principles.
  • Resource  Description Framework (RDF): information about content, such as  authors, licensing information and the unique identifier for the article

##Principles

###Principle 1: Right of Legitimate Accessors to Mine

We assert that there is no legal, ethical or moral reason to refuse to  allow legitimate accessors of research content (OA or otherwise) to use  machines to analyse the published output of the research community.   Researchers expect to access and process the full content of the research literature with their computer programs and should be able to use their machines as they use their eyes.

The right to read is the right to mine

###Principle 2: Lightweight Processing Terms and Conditions

Mining  by legitimate subscribers should not be prohibited by contractual or  other legal barriers.  Publishers should add clarifying language in  subscription agreements that content is available for information mining by download or by remote access.  Where access is through researcher-provided tools, no further cost should be required.

Users and providers should encourage machine processing

###Principle 3: Use

Researchers can and will publish facts and excerpts which they discover by reading and processing documents.  They expect to disseminate and aggregate statistical results as facts and context text as fair use excerpts, openly and with no restrictions other than attribution. Publisher  efforts to claim rights in the results of mining further retard the advancement of science by making those results less available to the research community; Such claims should be prohibited.

Facts don’t belong to anyone.

##Strategies

We plan to assert the above rights by:

  • Educating  researchers and librarians about the potential of content mining and the current impediments to doing so, including alerting librarians to the need not to cede any of the above rights when signing contracts with  publishers
  • Compiling  a list of publishers and indicating what rights they currently permit,  in order to highlight the gap between the rights here being asserted and  what is currently possible
  • Urging governments and funders to promote and aid the enjoyment of the above rights

[1]  Panzer-Steindel, Bernd, Sizing and Costing of the CERN T0 center, CERN-LCG-PEB-2004-21, 09 June 2004, at http://lcg.web.cern.ch/lcg/planning/phase2_resources/SizingandcostingoftheCERNT0center.pdf.

[2]  The Value and Benefits of Text Mining, JISC, Report Doc #811, March 2012, Section 3.3.8 at http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx,  citing P.J.Herron, “Text Mining Adoption for Pharmacogenomics-based  Drug Discovery in a Large Pharmaceutical Company: a Case STudy,”  Library, 2006, claiming that text mining tools evaluated 50,000 patents  in 18 months, a task that would have taken 50 person years to manually.

[3] See MEDLINE® Citation Counts by Year of Publication, at http://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html and National Science Foundation, Science and Engineering Indicators: 2010, Chapter 5 at http://www.nsf.gov/statistics/seind10/c5/c5h.htm asserting the annual volume of scientific journal articles published is on the order of 2.5%.

+ posts