The following is a draft content mining declaration developed by the Open Knowledge Foundation’s Working Group on Open Access
In brief: The Right to Read Is the Right to Mine
##Introduction
Researchers can find and read papers online, rather than having to manually track down print copies. Machines (computers) can index the papers and extract the details (titles, keywords etc.) in order to alert scientists to relevant material. In addition, computers can extract factual data and meaning by “mining” the content, opening up the possibility that machines could be used to make connections (and even scientific discoveries) that might otherwise remain invisible to researchers.
However, it is not generally possible today for computers to mine the content in papers due to constraints imposed by publishers. While Open Access (OA) is improving the ability for researchers to read papers (by removing access barriers), still only around 20% of scholarly papers are OA. The remainder are locked behind paywalls. As per the vast majority of subscription contracts, Subscribers may read paywalled papers, but they may not mine them.
Content mining is the way that modern technology locates digital information. Because digitized scientific information comes from hundreds of thousands of different sources in today’s globally connected scientific community [2] and because current data sets can be measured in terabytes,[1] it is often no longer possible to simply read a scholarly summary in order to make scientifically significant use of such information.[3] A researcher must be able to copy information, recombine it with other data and otherwise “re-use” it so as to produce truly helpful results. Not only is it a deductive tool to analyze research data, it is how search engines operate to allow discovery of content. To prevent mining is therefore to force scientists into blind alleys and silos where only limited knowledge is accessible. Science does not progress if it cannot incorporate the most recent findings and move forward from there.
##Definition
‘Open Content Mining’ means the unrestricted right of subscribers to extract, process and republish content manually or by machine in whatever form (text, diagrams, images, data, audio, video, etc.) without prior specific permissions and subject only to community norms of responsible behaviour in the electronic age.
- Text
- Numbers
- Tables: numerical representations of a fact
- Diagrams (line drawings, graphs, spectra, networks, etc.): Graphical representations of relationships between variables, are images and therefore may not be, when considered as a collective entity, data. However, the individual data points underlying a graph, similar to tables, should be.
- Images and video (mainly photographic)- where it is the means of expressing a fact?
- Audio: same as images – where it is expresses the factual representation of the research?
- XML: Extensible Markup Language (XML) defines rules for encoding documents in a format that is both human-readable and machine-readable.”<
- Core bibliographic data: described as “data which is necessary to identify and / or discover a publication” and defined under the Open Bibliography Principles.
- Resource Description Framework (RDF): information about content, such as authors, licensing information and the unique identifier for the article
##Principles
###Principle 1: Right of Legitimate Accessors to Mine
We assert that there is no legal, ethical or moral reason to refuse to allow legitimate accessors of research content (OA or otherwise) to use machines to analyse the published output of the research community. Researchers expect to access and process the full content of the research literature with their computer programs and should be able to use their machines as they use their eyes.
The right to read is the right to mine
###Principle 2: Lightweight Processing Terms and Conditions
Mining by legitimate subscribers should not be prohibited by contractual or other legal barriers. Publishers should add clarifying language in subscription agreements that content is available for information mining by download or by remote access. Where access is through researcher-provided tools, no further cost should be required.
Users and providers should encourage machine processing
###Principle 3: Use
Researchers can and will publish facts and excerpts which they discover by reading and processing documents. They expect to disseminate and aggregate statistical results as facts and context text as fair use excerpts, openly and with no restrictions other than attribution. Publisher efforts to claim rights in the results of mining further retard the advancement of science by making those results less available to the research community; Such claims should be prohibited.
Facts don’t belong to anyone.
##Strategies
We plan to assert the above rights by:
- Educating researchers and librarians about the potential of content mining and the current impediments to doing so, including alerting librarians to the need not to cede any of the above rights when signing contracts with publishers
- Compiling a list of publishers and indicating what rights they currently permit, in order to highlight the gap between the rights here being asserted and what is currently possible
- Urging governments and funders to promote and aid the enjoyment of the above rights
[1] Panzer-Steindel, Bernd, Sizing and Costing of the CERN T0 center, CERN-LCG-PEB-2004-21, 09 June 2004, at http://lcg.web.cern.ch/lcg/planning/phase2_resources/SizingandcostingoftheCERNT0center.pdf.
[2] The Value and Benefits of Text Mining, JISC, Report Doc #811, March 2012, Section 3.3.8 at http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx, citing P.J.Herron, “Text Mining Adoption for Pharmacogenomics-based Drug Discovery in a Large Pharmaceutical Company: a Case STudy,” Library, 2006, claiming that text mining tools evaluated 50,000 patents in 18 months, a task that would have taken 50 person years to manually.
[3] See MEDLINE® Citation Counts by Year of Publication, at http://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html and National Science Foundation, Science and Engineering Indicators: 2010, Chapter 5 at http://www.nsf.gov/statistics/seind10/c5/c5h.htm asserting the annual volume of scientific journal articles published is on the order of 2.5%.