Support Us

You are browsing the archive for CKAN.

Beauty behind the scenes

Tryggvi Björgvinsson - August 5, 2015 in CKAN, OKF Sweden, Open Data, open knowledge

Good things can often go unnoticed, especially if they’re not immediately visible. Last month the government of Sweden, through Vinnova, released a revamped version of their open data portal, Öppnadata.se. The portal still runs on CKAN, the open data management system. It even has the same visual feeling but the principles behind the portal are completely different. The main idea behind the new version of Öppnadata.se is automation. Open Knowledge teamed up with the Swedish company Metasolutions to build and deliver an automated open data portal.

Responsive design

In modern web development, one aspect of website automation called responsive design has become very popular. With this technique the website automatically adjusts the presentation depending on the screen size. That is, it knows how best to present the content given different screen sizes. Öppnadata.se got a slight facelift in terms of tweaks to its appearance, but the big news on that front is that it now has a responsive design. The portal looks different if you access it on mobile phones or if you visit it on desktops, but the content is still the same.

These changes were contributed to CKAN. They are now a part of the CKAN core web application as of version 2.3. This means everyone can now have responsive data portals as long as they use a recent version of CKAN.

New Öppnadata.se

New Öppnadata.se

Old Öppnadata.se

Old Öppnadata.se

Data catalogs

Perhaps the biggest innovation of Öppnadata.se is how the automation process works for adding new datasets to the catalog. Normally with CKAN, data publishers log in and create or update their datasets on the CKAN site. CKAN has for a long time also supported something called harvesting, where an instance of CKAN goes out and fetches new datasets and makes them available. That’s a form of automation, but it’s dependent on specific software being used or special harvesters for each source. So harvesting from one CKAN instance to another is simple. Harvesting from a specific geospatial data source is simple. Automatically harvesting from something you don’t know and doesn’t exist yet is hard.

That’s the reality which Öppnadata.se faces. Only a minority of public organisations and municipalities in Sweden publish open data at the moment. So a decision hasn’t been made by a majority of the public entities for what software or solution will be used to publish open data.

To tackle this problem, Öppnadata.se relies on an open standard from the World Wide Web Consortium called DCAT (Data Catalog Vocabulary). The open standard describes how to publish a list of datasets and it allows Swedish public bodies to pick whatever solution they like to publish datasets, as long as one of its outputs conforms with DCAT.

Öppnadata.se actually uses a DCAT application profile which was specially created for Sweden by Metasolutions and defines in more detail what to expect, for example that Öppnadata.se expects to find dataset classifications according the Eurovoc classification system.

Thanks to this effort significant improvements have been made to CKAN’s support for RDF and DCAT. They include application profiles (like the Swedish one) for harvesting and exposing DCAT metadata in different formats. So a CKAN instance can now automatically harvest datasets from a range of DCAT sources, which is exactly what Öppnadata.se does. For Öppnadata.se, the CKAN support also makes it easy for Swedish public bodies who use CKAN to automatically expose their datasets correctly so that they can be automatically harvested by Öppnadata.se. For more information have a look at the CKAN DCAT extension documentation.

Dead or alive

The Web is decentralised and always changing. A link to a webpage that worked yesterday might not work today because the page was moved. When automatically adding external links, for example, links to resources for a dataset, you run into the risk of adding links to resources that no longer exist.

To counter that Öppnadata.se uses a CKAN extension called Dead or alive. It may not be the best name, but that’s what it does. It checks if a link is dead or alive. The checking itself is performed by an external service called deadoralive. The extension just serves a set of links that the external service decides to check to see if some links are alive. In this way dead links are automatically marked as broken and system administrators of Öppnadata.se can find problematic public bodies and notify them that they need to update their DCAT catalog (this is not automatic because nobody likes spam).

These are only the automation highlights of the new Öppnadata.se. Other changes were made that have little to do with automation but are still not immediately visible, so a lot of Öppnadata.se’s beauty happens behind the scenes. That’s also the case for other open data portals. You might just visit your open data portal to get some open data, but you might not realise the amount of effort and coordination it takes to get that data to you.

Image of Swedish flag by Allie_Caulfield on Flickr (cc-by)

This post has been republished from the CKAN blog.

Presenting public finance just got easier

Tryggvi Björgvinsson - March 20, 2015 in CKAN, Open Spending

This blog post is cross-posted from the CKAN blog.

mexico_ckan_openspending

CKAN 2.3 is out! The world-famous data handling software suite which powers data.gov, data.gov.uk and numerous other open data portals across the world has been significantly upgraded. How can this version open up new opportunities for existing and coming deployments? Read on.

One of the new features of this release is the ability to create extensions that get called before and after a new file is uploaded, updated, or deleted on a CKAN instance.

This may not sound like a major improvement but it creates a lot of new opportunities. Now it’s possible to analyse the files (which are called resources in CKAN) and take them to new uses based on that analysis. To showcase how this works, Open Knowledge in collaboration with the Mexican government, the World Bank (via Partnership for Open Data), and the OpenSpending project have created a new CKAN extension which uses this new feature.

It’s actually two extensions. One, called ckanext-budgets listens for creation and updates of resources (i.e. files) in CKAN and when that happens the extension analyses the resource to see if it conforms to the data file part of the Budget Data Package specification. The budget data package specification is a relatively new specification for budget publications, designed for comparability, flexibility, and simplicity. It’s similar to data packages in that it provides metadata around simple tabular files, like a csv file. If the csv file (a resource in CKAN) conforms to the specification (i.e. the columns have the correct titles), then the extension automatically creates the Budget Data Package metadata based on the CKAN resource data and makes the complete Budget Data Package available.

It might sound very technical, but it really is very simple. You add or update a csv file resource in CKAN and it automatically checks if it contains budget data in order to publish it on a standardised form. In other words, CKAN can now automatically produce standardised budget resources which make integration with other systems a lot easier.

The second extension, called ckanext-openspending, shows how easy such an integration around standardised data is. The extension takes the published Budget Data Packages and automatically sends it to OpenSpending. From there OpenSpending does its own thing, analyses the data, aggregates it and makes it very easy to use for those who use OpenSpending’s visualisation library.

So thanks to a perhaps seemingly insignificant extension feature in CKAN 2.3, getting beautiful and understandable visualisations of budget spreadsheets is now only an upload to a CKAN instance away (and can only get easier as the two extensions improve).

To learn even more, see this report about the CKAN and OpenSpending integration efforts.

Pakistan Data Portal

jobarratt - February 9, 2015 in CKAN, open knowledge

December 2014 saw the Sustainable Development Policy Institute and Alif Ailaan launch the Pakistan Data Portal at the 30th Annual Sustainable Development Conference. The portal, built using CKAN by Open Knowledge, provides an access point for viewing and sharing data relating to all aspects of education in Pakistan.

20140115_123213A particular focus of this project was to design an open data portal that could be used to support advocacy efforts by Alif Ailaan, an organisation dedicated to improving education outcomes in Pakistan.

The Pakistan Data Portal (PDP) is the definitive collection of information on education in Pakistan and collates datasets from private and public research organisations on topics including infrastructure, finance, enrollment, and performance to name a few. The PDP is a single point of access against which change in Pakistani education can be tracked and analysed. Users, who include teachers, parents, politicians and policy makers are able to browse historical data can compare and contrast it across regions and years to reveal a clear, customizable picture of the state of education in Pakistan. From this clear overview, the drivers and constraints of reform can be identified which allow Alif Ailaan and others pushing for change in the country to focus their reform efforts.

Pakistan is facing an education emergency. It is a country with 25m children out of education and 50% girls of school age do not attend classes. A census has not been completed since 1998 and there are problems with the data that is available. It is outdated, incomplete, error-ridden and only a select few have access to much of it. An example that highlights this is a recent report from ASER, which estimates the number of children out of school at 16 million fewer than the number computed by Alif Ailaan in another report.  NGOs and other advocacy groups have tended to only be interested in data when it can be used to confirm that the funds they are utilising are working. Whilst there is agreement on the overall problem, If people can not agree on its’ scale, how can a consensus solution be hoped for?

Alif Ailaan believe if you can’t measure the state of education in the country, you cant hope to fix it fix it. This forms the focus of their campaigning efforts. So whilst the the quality of the data is a problem, some data is better than no data, and the PDP forms a focus for gathering quality information together and for building a platform from which to build change and promote policy change— policy makers can make accurate decisions which are backed up.

The data accessible through the portal is supported by regular updates from the PDP team who draw attention to timely key issues and analyse the data. A particular subject or dataset will be explored from time to time and these general blog post are supported by “The Week in Education” which summarises the latest education news, data releases and publications.

CKAN was chosen as the portal best placed to meet the needs of the PDP. Open Knowledge were tasked with customising the portal and providing training and support to the team maintaining it. A custom dashboard system was developed for the platform in order to present data in an engaging visual format.

SAM_0561As explained by Asif Mermon, Associate Research Fellow at SDPI, the genius of the portal is the shell. As institutions start collecting data, or old data is uncovered, it can be added to the portal to continually improve the overall picture.

The PDP is in constant development to further promote the analysis of information in new ways and build on the improvement of the visualizations on offer. There are also plans to expand the scope of the portal, so that areas beyond education can also reap its’ benefits. A further benefit is that the shell can then be be exported around the world so other countries will be able to benifit from the development.

The PDP initiative is part of the multi-year DFID-funded Transforming Education Pakistan (TEP) campaign aiming to increase political will to deliver education reform in Pakistan. Accadian, on behalf of HTSPE, appointed the Open Knowledge Foundation to build the data observatory platform and provide support in managing the upload of data including onsite visits to provide training in Pakistan.

 

Take a CKAN Tour

Heather Leson - May 1, 2014 in CKAN, Events, OKF Projects

From baby name datasets and apps via the South Australian government to new City of Surrey, B.C., (Canada) site, there are many instances of CKAN around the world. CKAN is the data management system that makes data accessible – by providing tools to streamline publishing, sharing, finding and using data. It is used by various levels of governments, civil societies and organization to make their data transparent and available.

In this 1-hour video hangout Irina Bolychevsky, Services Director gives us an overview of CKAN with live demo’s of several CKAN sites including data.gov.uk, publicdata.eu and data.glasgow.gov.uk. She also answered community questions.

ckan-logo
Get Involved

CKAN has a wide community of contributors working to remix and extend the software. Two examples of code that folks have contributed includes Ckanext-spatial and ckanext-realtime (github links).

The CKAN core committers host regular online developer meetings. These are every Tuesday and Thursday 13:00 – 14: 00 EDT reviewing pull requests and discussing architecture. We meet up on ckan developer mailing list, being on the #ckan irc channel in freenode (to the the google hangout link for meetings!) and commenting on github tickets. All welcome.

Community questions tend to be asked on StackOverflow using the CKAN tag on Stack Overflow. You can also file issues/contribute code on github.

Contact us

If you want to talk about CKAN development, please come and say hi on the ckan-dev mailing list or the #ckan IRC channel on irc.freenode.org. If you have service inquiries, you can reach out to the team: services at ckan dot org

Upcoming Community Sessions: CKAN, Community Feedback

Heather Leson - April 28, 2014 in CKAN, Events, Network, Open Knowledge Foundation Local Groups, Our Work, Working Groups

Happy week! We are hosting two Community Sessions this week. You have expressed an interest in learning more about CKAN. As well, We are continuing our regular Community Feedback sessions.

Boy and the world image

Take a CKAN Tour:

This week we will give an overview and tour of CKAN – the leading open source open data platform used by the national governments of the US, UK, Brazil, Canada, Australia, France, Germany, Austria and many more. This session will cover why data portals are useful, what they provide and showcase examples and best practices from CKAN’s varied user base! Bring your questions on how to get started and best practices.

Guest: Irina Bolychevsky, Services Director (Open Knowledge) Questions are welcome via G+ or Twitter.

  • Date: Wednesday, April 30, 2014
  • Time: 7:30 PT /10:30 ET /14:30 UTC /15:30 BST/16:30 CEST
  • Duration: 1 hour
  • Register and Join via G+ (The Hangout will be recorded.)
Community Feedback Session

We promised to schedule another Community Feedback Session. It is hard to find a common time for folks. We will work on timeshifting these for next sessions. This is a chance to ask questions, give input and help shape Open Knowledge.

Please join Laura, Naomi and I for the next Community Feedback Session. Bring your ideas and questions.

  • Date: Wednesday, April 30, 2014
  • Time:9:00 PT/12:00EDT/16:00 UTC /17:00 BST/18:00 CEST
  • Duration:1 hour
  • Join via Meeting Burner

We will use Meeting Burner and IRC. (Note: We will record both of these.)

How to join meeting Burner: Audio instructions Option 1 Dial-in to the following conference line: Number 1- (949) 229 – 4400 # Option 2 You may join the conference bridge with your computer’s microphone/speakers or headset

How to join IRC: http://wiki.okfn.org/How_to_use_IRC/_Clients_and_Tips

More about the new Open Knowledge Brand

Host a Community Session in May

We are booking Community Sessions for May. These Open Knowledge online events can be in a number of forms: a scheduled IRC chat, a community google hangout, a technical sprint or an editathon. The goal is to connect the community to learn and share their stories and skills. If you would like to suggest a session or host one, please contact heather dot leson at okfn dot org.

More details about Community Sessions

(Photo: Heather Leson (San Francisco))

Building an archaeological project repository II: Where are the research data repositories?

Guest - April 17, 2014 in CKAN, Open Science, WG Archaeology

This is a guest post by Anthony Beck, Honorary fellow, and Dave Harrison, Research fellow, at the University of Leeds School of Computing

DART_UML_DART_2011_2013_RAW

Data repository as research tool

In a previous post, we examined why Open Science is necessary to take advantage of the huge corpus of data generated by modern science. In our project Detection of Archaeological residues using Remote sensing Techniques, or DART, we adopted Open Science principles and made all the project’s extensive data available through a purpose-built data repository built on the open-source CKAN platform. But with so many academic repositories, why did we need to roll our own? A final post will look at how the portal was implemented.

DART: data-driven archaeology

DART’s overall aim is to develop analytical methods to differentiate archaeological sediments from non-archaeological strata, on the basis of remotely detected phenomena (e.g. resistivity, apparent dielectric permittivity, crop growth, thermal properties etc). DART is a data rich project: over a 14 month period, in-situ soil moisture, soil temperature and weather data were collected at least once an hour; ground based geophysical surveys and spectro-radiometry transects were conducted at least monthly; aerial surveys collecting hyperspectral, LiDAR and traditional oblique and vertical photographs were taken throughout the year, and laboratory analyses and tests were conducted on both soil and plant samples. The data archive itself is in the order of terabytes.

Analysis of this archive is ongoing; meanwhile, this data and other resources are made available through open access mechanisms under liberal licences and are thus accessible to a wide audience. To achieve this we used the open-source CKAN platform to build a data repository, DARTPortal, which includes a publicly queryable spatio-temporal database (on the same host), and can support access to individual data as well as mining or analysis of integrated data.

This means we can share the data analysis and transformation processes and demonstrate how we transform data into information and synthesise this information into knowledge (see, for example, this Ipython notebook which dynamically exploits the database connection). This is the essence of Open Science: exposing the data and processes that allow others to replicate and more effectively build on our science.

Lack of existing infrastructure

Pleased though we are with our data repository, it would have been nice not to have to build it! Individual research projects should not bear the burden of implementing their own data repository framework. This is much better suited to local or national institutions where the economies of scale come into their own. Yet in 2010 the provision of research data infrastructure that supported what DART did was either non-existent or poorly advertised. Where individual universities provided institutional repositories, these were focused on publications (the currency of prestige and career advancement) and not on data. Irrespective of other environments, none of the DART collaborating partners provided such a data infrastructure.

Data sharing sites like Figshare did not exist – and when it did exist the size of our hyperspectral data, in particular, was quite rightly a worry. This situation is slowly changing, but it is still far from ideal. The positions taken by Research Councils UK and the Engineering and Physical Science Research Council (EPSRC) on improving access to data are key catalysts for change. The EPSRC statement is particularly succinct:

Two of the principles are of particular importance: firstly, that publicly funded research data should generally be made as widely and freely available as possible in a timely and responsible manner; and, secondly, that the research process should not be damaged by the inappropriate release of such data.

This has produced a simple economic issue – if research institutions can not demonstrate that they can manage research data in the manner required by the funding councils then they will become ineligible to receive grant funding from that council. The impact is that the majority of universities are now developing their own, or collaborating on communal, data repositories.

But what about formal data deposition environments?

DART was generously funded through the Science and Heritage Programme supported by the UK Arts and Humanities Research Council (AHRC) and the EPSRC. This means that these research councils will pay for data archiving in the appropriate domain repository, in this case the Archaeology Data Service (ADS). So why produce our own repository?

Deposition to the ADS would only have occurred after the project had finished. With DART, the emphasis has been on re-use and collaboration rather than primarily on archiving. These goals are not mutually exclusive: the methods adopted by DART mean that we produced data that is directly suitable for archiving (well documented ASCII formats, rich supporting description and discovery metadata, etc) whilst also allowing more rapid exposure and access to the ‘full’ archive. This resulted in DART generating much richer resource discovery and description metadata than would have been the case if the data was simply deposited into the ADS.

The point of the DART repository was to produce an environment which would facilitate good data management practice and collaboration during the lifetime of the project. This is representative of a crucial shift in thinking, where projects and data collectors consider re-use, discovery, licences and metadata at a much earlier stage in the project life cycle: in effect, to create dynamic and accessible repositories that have impact across the broad stakeholder community rather than focussing solely on the academic community. The same underpinning philosophy of encouraging re-use is seen at both FigShare and DataHub. Whilst formal archiving of data is to be encouraged, if it is not re-useable, or more importantly easily re-useable, within orchestrated scientific workflow frameworks then what is the point.

In addition, it is unlikely that the ADS will take the full DART archive. It has been said that archaeological archives can produce lots of extraneous or redundant ‘stuff’. This can be exacerbated by the unfettered use of digital technologies – how many digital images are really required for the same trench? Whilst we have sympathy with this argument, there is a difference between ‘data’ and ‘pretty pictures': as data analysts, we consider that a digital photograph is normally a data resource and rarely a pretty picture. Hence, every image has value.

This is compounded when advances in technology mean that new data can be extracted from ‘redundant’ resources. For example, Structure from Motion (SfM) is a Computer Vision technique that extracts 3D information from 2D objects. From a series of overlapping photographs, SfM techniques can be used to extract 3D point clouds and generate orthophotographs from which accurate measurements can be taken. In the case of SfM there is no such thing as redundancy, as each image becomes part of a ‘bundle’ and the statistical characteristics of the bundle determine the accuracy of the resultant model. However, one does need to be pragmatic, and it is currently impractical for organisations like the ADS to accept unconstrained archives. That said, it is an area that needs review: if a research object is important enough to have detailed metadata created about it, then it should be important enough to be archived.

For DART, this means that the ADS is hosting a subset of the archive in long-term re-use formats, which will be available in perpetuity (which formally equates to a maximum of 25 years), while the DART repository will hold the full archive in long term re-use formats until we run out of server money. We are are in discussion with Leeds University to migrate all the data objects over to the new institutional repository with sparkling new DOIs and we can transfer the metadata held in CKAN over to Open Knowledge’s public repository, the dataHub. In theory nothing should be lost.

How long is forever?

The point on perpetuity is interesting. Collins Dictionary defines perpetuity as ‘eternity’. However, the ADS defines ‘digital’ perpetuity as 25 years. This raises the question: is it more effective in the long term to deposit in ‘formal’ environments (with an intrinsic focus on preservation format over re-use), or in ‘informal’ environments (with a focus on re-use and engagement over preservation (Flickr, Wikimedia Commons, DART repository based on CKAN, etc)? Both Flickr and Wikimedia Commons have been around for over a decade. Distributed peer to peer sharing, as used in Git, produces more robust and resilient environments which are equally suited to longer term preservation. Whilst the authors appreciate that the situation is much more nuanced, particularly with the introduction of platforms that facilitate collaborative workflow development, this does have an impact on long-term deployment.

Choosing our licences

Licences are fundamental to the successful re-use of content. Licences describe who can use a resource, what they can do with this resource and how they should reference any resource (if at all).

Two lead organisations have developed legal frameworks for content licensing, Creative Commons (CC) and Open Data Commons (ODC). Until the release of CC version 4, published in November 2013, the CC licence did not cover data. Between them, CC and ODC licences can cover all forms of digital work.

At the top level the licences are permissive public domain licences (CC0 and PDDL respectively) that impose no restrictions on the licensees use of the resource. ‘Anything goes’ in a public domain licence: the licensee can take the resource and adapt it, translate it, transform it, improve upon it (or not!), package it, market it, sell it, etc. Constraints can be added to the top level licence by employing the following clauses:

  • BY – By attribution: the licensee must attribute the source.
  • SA – Share-alike: if the licensee adapts the resource, they must release the adapted resource under the same licence.
  • NC – Non commercial: the licensee must not use the work within a commercial activity without prior approval. Interestingly, in many area of the world, the use of material in university lectures may be considered a commercial activity. The non-commercial restriction about the nature of the activity, not the legal status of the institution doing the work.
  • ND – No derivatives: the licensee can not derive new content from the resource.

Each of these clauses decreases the ‘open-ness’ of the resource. In fact, the NC and ND clause are not intrinsically open (they restrict both who can use and what you can do with the resource). These restrictive clauses have the potential to produce license incompatibilities which may introduce profound problems in the medium to long term. This is particularly relevant to the SA clause. Share-alike means that any derived output must be licensed under the same conditions as the source content. If content is combined (or mashed up) – which is essential when one is building up a corpus of heritage resources – then content created under a SA clause can not be combined with content that includes a restrictive clause (BY, NC or ND) that is not in the source licence. This licence incompatibility has a significant impact on the nature of the data commons. It has the potential to fragment the data landscape creating pockets of knowledge which are rarely used in mainstream analysis, research or policy making. This will be further exacerbated when automated data aggregation and analysis systems become the norm. A permissive licence without clauses like Non-commercial, Share-alike or No-derivatives removes such licence and downstream re-user fragmentation issues.

For completeness, specific licences have been created for Open Government Data. The UK Government Data Licence for public sector information is essentially an open licence with a BY attribution clause.

At DART we have followed the guidelines of The Open Data Institute and separated out creative content (illustrations, text, etc.) from data content. Hence, the DART content is either CC-BY or ODC-BY respectively. In the future we believe it would be useful to drop the BY (attribution) clause. This would stop attribute stacking (if the resource you are using is a derivative of a derivative of a derivative of a ….. (you get the picture), at what stage do you stop attribution) and anything which requires bureaucracy, such as attributing an image in a powerpoint presentation, inhibits re-use (one should always assume that people are intrinsically lazy). There is a post advocating ccZero+ by Dan Cohen. However, impact tracking may mean that the BY clause becomes a default for academic deposition.

The ADS uses a more restrictive bespoke default licence which does not map to national or international licence schemes (they also don’t recognise non CC licences). Resources under this licence can only be used for teaching, learning, and research purposes. Of particular concern is their use of the NC clause and possible use of the ND clause (depending on how you interpret the licence). Interestingly, policy changes mean that the use of data under the bespoke ADS licence becomes problematic if university teaching activities are determined to be commercial. It is arguable that the payment of tuition fees represents a commercial activity. If this is true then resources released under the ADS licence can not be used within university teaching which is part of a commercial activity. Hence, the policy change in student tuition and university funding has an impact on the commercial nature of university teaching which has a subsequent impact on what data or resources universities are licensed to use. Whilst it may never have been the intention of the ADS to produce a licence with this potential paradox, it is a problem when bespoke licences are developed, even if they were originally perceived to be relatively permissive licences. To remove this ambiguity it is recommended that submissions to the ADS are provided under a CC licence which renders the bespoke ADS licence void.

In the case of DART, these licence variations with the ADS should not be a problem. Our licences are permissive (by attribution is the only clause we have included). This means the ADS can do anything they want with our resources as long as they cite the source. In our case this would be the individual resource objects or collections on the DART portal. This is a good thing, as the metadata on the DART portal is much richer than the metadata held by the ADS.

Concerns about opening up data, and responses which have proved effective

Christopher Gutteridge (University of Southampton) and Alexander Dutton (University of Oxford) have collated a Google doc entitled ‘Concerns about opening up data, and responses which have proved effective‘. This document describes a number of concerns commonly raised by academic colleagues about increasing access to data. For DART two issues became problematic that were not covered by this document:

  • The relationship between open data and research novelty and the impact this may have on a PhD submission.
  • Journal publication – specifically that a journal won’t publish a research paper if the underlying data is open.

The former point is interesting – does the process of undertaking open science, or at least providing open data, undermine the novelty of the resultant scientific process? With open science it could be difficult to directly attribute the contribution, or novelty, of a single PhD student to an openly collaborative research process. However, that said, if online versioning tools like Git are used, then it is clear who has contributed what to a piece of code or a workflow (the benefits of the BY clause). This argument is less solid when we are talking solely about open data. Whilst it is true that other researchers (or anybody else for that matter) have access to the data, it is highly unlikely that multiple researchers will use the same data to answer exactly the same question. If they do ask the same question (and making the optimistic assumption that they reach the same conclusion), it is still highly unlikely that they will have done so by the same methods; and even if they do, their implementations will be different. If multiple methods using the same source data reach the same conclusion then there is an increased likelihood that the conclusion is correct and that the science is even more certain. The underlying point here is that 21st-century scientific practice will substantially benefit from people showing their working. Exposure of the actual process of scientific enquiry (the algorithms, code, etc.) will make the steps between data collection and publication more transparent, reproduceable and peer-reviewable – or, quite simply, more scientific. Hence, we would argue that open data and research novelty is only a problem if plagiarism is a problem.

The journal publication point is equally interesting. Publications are the primary metric for academic career progression and kudos. In this instance it was the policy of the ‘leading journal in this field’ that they would not publish a paper from a dataset that was already published. No credible reasons were provided for this clause – which seems draconian in the extreme. It does indicate that no one size fits all approach will work in the academic landscape. It will also be interesting to see how this journal, which publishes work which is mainly funded by EPSRC, responds to the EPSRC guidelines on open data.

This is also a clear demonstration that the academic community needs to develop new metrics that are more suited to 21st century research and scholarship by directly link academic career progression to other source of impact that go beyond publications. Furthermore, academia needs some high-profile exemplars that demonstrate clearly how to deal with such change. The policy shift and ongoing debate concerning ‘Open access’ publications in the UK is changing the relationship between funders, universities, researchers, journals and the public – a similar debate needs to occur about open data and open science.

The altmetrics community is developing new metrics for “analyzing, and informing scholarship” and have described their ethos in their manifesto. The Research Councils and Governments have taken a much greater interest in the impact of publically funded research. Importantly public, social and industry impact are as important as academic impact. It is incumbent on universities to respond to this by directly linking academic career progression through to impact and by encouraging improved access to the underlying data and procesing outputs of the research process through data repositories and workflow environments.

Skillshares and Stories: Upcoming Community Sessions

Heather Leson - April 3, 2014 in CKAN, Events, Network, OKF Brazil, OKF Projects, Open Access, Open Knowledge Foundation Local Groups, School of Data

We’re excited to share with you a few upcoming Community Sessions from the School of Data, CKAN, Open Knowledge Brazil, and Open Access. As we mentioned earlier this week, we aim to connect you to each other. Join us for the following events!

What is a Community Session: These online events can be in a number of forms: a scheduled IRC chat, a community google hangout, a technical sprint or hackpad editathon. The goal is to connect the community to learn and share their stories and skills.

We held our first Community Session yesterday. (see our Wiki Community Session notes) The remaining April events will be online via G+. These sessions will be a public Hangout to Air. The video will be available on the Open Knowledge Youtube Channel after the event. Questions are welcome via Twitter and G+.

All these sessions are Wednesdays at 10:30 – 11:30 am ET/ 14:30 – 15:30 UTC.

Mapping with Ketty and Ali: a School of Data Skillshare (April 9, 2014)

Making a basic map from spreadsheet data: We’ll explore tools like QGIS (a free and Open-source Geographic Information System), Tilemill (a tool to design beautiful interactive web maps) Our guest trainers are Ketty Adoch and Ali Rebaie.

To join the Mapping with Ketty and Ali Session on April 9, 2014

Q & A with Open Knowledge Brazil Chapter featuring Everton(Tom) Zanella Alvarenga (April 16, 2014)

Around the world, local groups, Chapters, projects, working groups and individuals connect to Open Knowledge. We want to share your stories.

In this Community Session, we will feature Everton (Tom) Zanella Alvarenga, Executive Director.

Open Knowledge Foundation Brazil is a newish Chapter. Tom will share his experiences growing a chapter and community in Brazil. We aim to connect you to community members around the world. We will also open up the conversation to all things Community. Share your best practices

Join us on April 16, 2014 via G+

Take a CKAN Tour (April 23, 2014)

This week we will give an overview and tour of CKAN – the leading open source open data platform used by the national governments of the US, UK, Brazil, Canada, Australia, France, Germany, Austria and many more. This session will cover why data portals are useful, what they provide and showcase examples and best practices from CKAN’s varied user base! Our special guest is Irina Bolychevsky, Services Director (Open Knowledge Foundation).

Learn and share your CKAN stories on April 23, 2014

(Note: We will share more details about the April 30th Open Access session soon!)

Resources

Building an archaeological project repository I: Open Science means Open Data

Guest - February 24, 2014 in CKAN, Open Science, WG Archaeology

This is a guest post by Anthony Beck, Honorary fellow, and Dave Harrison, Research fellow, at the University of Leeds School of Computing.

In 2010 we authored a series of blog posts for the Open Knowledge Foundation subtitled ‘How open approaches can empower archaeologists’. These discussed the DART project, which is on the cusp of concluding.

The DART project collected large amounts of data, and as part of the project, we created a purpose-built data repository to catalogue this and make it available, using CKAN, the Open Knowledge Foundation’s open-source data catalogue and repository. Here we revisit the need for Open Science in the light of the DART project. In a subsequent post we’ll look at why, with so many repositories of different kinds, we felt that to do Open Science successfully we needed to roll our own.

Open data can change science

Open inquiry is at the heart of the scientific enterprise. Publication of scientific theories – and of the experimental and observational data on which they are based – permits others to identify errors, to support, reject or refine theories and to reuse data for further understanding and knowledge. Science’s powerful capacity for self-correction comes from this openness to scrutiny and challenge. (The Royal Society, Science as an open enterprise, 2012)

The Royal Society’s report Science as an open enterprise identifies how 21st century communication technologies are changing the ways in which scientists conduct, and society engages with, science. The report recognises that ‘open’ enquiry is pivotal for the success of science, both in research and in society. This goes beyond open access to publications (Open Access), to include access to data and other research outputs (Open Data), and the process by which data is turned into knowledge (Open Science).

The underlying rationale of Open Data is this: unfettered access to large amounts of ‘raw’ data enables patterns of re-use and knowledge creation that were previously impossible. The creation of a rich, openly accessible corpus of data introduces a range of data-mining and visualisation challenges, which require multi-disciplinary collaboration across domains (within and outside academia) if their potential is to be realised. An important step towards this is creating frameworks which allow data to be effectively accessed and re-used. The prize for succeeding is improved knowledge-led policy and practice that transforms communities, practitioners, science and society.

The need for such frameworks will be most acute in disciplines with large amounts of data, a range of approaches to analysing the data, and broad cross-disciplinary links – so it was inevitable that they would prove important for our project, Detection of Archaeological residues using Remote sensing Techniques (DART).

DART: data-driven archaeology

DART aimed is to develop analytical methods to differentiate archaeological sediments from non-archaeological strata, on the basis of remotely detected phenomena (e.g. resistivity, apparent dielectric permittivity, crop growth, thermal properties etc). The data collected by DART is of relevance to a broad range of different communities. Open Science was adopted with two aims:

  • to maximise the research impact by placing the project data and the processing algorithms into the public sphere;
  • to build a community of researchers and other end-users around the data so that collaboration, and by extension research value, can be enhanced.

‘Contrast dynamics’, the type of data provided by DART, is critical for policy makers and curatorial managers to assess both the state and the rate of change in heritage landscapes, and helps to address European Landscape Convention (ELC) commitments. Making the best use of the data, however, depends on openly accessible dynamic monitoring, along the lines of that developed for the Global Monitoring for Environment and Security (GMES) satellite constellations under development by the European Space Agency. What is required is an accessible framework which allows all this data to be integrated, processed and modelled in a timely manner.

It is critical that policy makers and curatorial managers are able to assess both the state and the rate of change in heritage landscapes. This need is wrapped up in national commitments to the European Landscape Convention (ELC). Making the best use of the data, however, depends on openly accessible dynamic monitoring, along similar lines to that proposed by the European Space Agency for the Global Monitoring for Environment and Security (GMES) satellite constellations. What is required is an accessible framework which allows all this data to be integrated, processed and modelled in a timely manner. The approaches developed in DART to improve the understanding and enhance the modelling of heritage contrast detection dynamics feeds directly into this long-term agenda.

Cross-disciplinary research and Open Science

Such approaches cannot be undertaken within a single domain of expertise. This vision can only be built by openly collaborating with other scientists and building on shared data, tools and techniques. Important developments will come from the GMES community, particularly from precision agriculture, soil science, and well documented data processing frameworks and services. At the same time, the information collected by projects like DART can be re-used easily by others. For example, DART data has been exploited by the Royal Agricultural University (RAU) for use in such applications as carbon sequestration in hedges, soil management, soil compaction and community mapping. Such openness also promotes collaboration: DART partners have been involved in a number of international grant proposals and have developed a longer term partnership with the RAU.

Open Science advocates opening access to data, and other scientific objects, at a much earlier stage in the research life-cycle than traditional approaches. Open Scientists argue that research synergy and serendipity occur through openly collaborating with other researchers (more eyes/minds looking at the problem). Of great importance is the fact that the scientific process itself is transparent and can be peer reviewed: as a result of exposing data and the processes by which these data are transformed into information, other researchers can replicate and validate the techniques. As a consequence, we believe that collaboration is enhanced and the boundaries between public, professional and amateur are blurred.

Challenges ahead for Open Science

Whilst DART has not achieved all its aims, it has made significant progress and has identified some barriers in achieving such open approaches. Key to this is the articulation of issues surrounding data-access (accreditation), licensing and ethics. Who gets access to data, when, and under what conditions, is a serious ethical issue for the heritage sector. These are obviously issues that need co-ordination through organisations like Research Councils UK with cross-cutting input from domain groups. The Arts and Humanities community produce data and outputs with pervasive social and ethical impact, and it is clearly important that they have a voice in these debates.

“Share, improve and reuse public sector data” – French Government unveils new CKAN-based data.gouv.fr

Guest - December 26, 2013 in CKAN, OKF France, Open Data, Open Government Data

This is a guest post from Rayna Stamboliyska and Pierre Chrzanowski of the Open Knowledge Foundation France

Etalab, the Prime Minister’s task force for Open Government Data, unveiled on December 18 the new version of the data.gouv.fr platform (1). OKF France salutes the work the Etalab team has accomplished, and welcomes the new features and the spirit of the new portal, rightly summed up in the website’s baseline, “share, improve and reuse public sector data”.

OKF France was represented by Samuel Goëta at the data.gouv.fr launch event OKF France was represented at the data.gouv.fr launch event by Samuel Goëta in the presence of Jean-Marc Ayrault, Prime Minister of France, Fleur Pellerin, Minister Delegate for Small and Medium Enterprises, Innovation, and the Digital Economy and Marylise Lebranchu, Minister of the Reform of the State. Photo credit: Yves Malenfer/Matignon

Etalab has indeed chosen to offer a platform resolutely turned towards collaboration between data producers and re-users. The website now enables everyone not only to improve and enhance the data published by the government, but also to share their own data; to our knowledge, a world first for a governmental open data portal. In addition to “certified” data (i.e., released by departments and public authorities), data.gouv.fr also hosts data published by local authorities, delegated public services and NGOs. Last but not least, the platform also identifies and highlights other, pre-existing, Open Data portals such as nosdonnees.fr (2). A range of content publishing features, a wiki and the possibility of associating reuses such as visualizations should also allow for a better understanding of the available data and facilitate outreach efforts to the general public.

We at OKF France also welcome the technological choices Etalab made. The new data.gouv.fr is built around CKAN, the open source software whose development is coordinated by the Open Knowledge Foundation. All features developed by the Etalab team will be available for other CKAN-based portals (e.g., data.gov or data.gov.uk). In turn, Etalab may more easily master innovations implemented by others.

The new version of the platform clearly highlights the quality rather than quantity of datasets. This paradigm shift was expected by re-users. On one hand, datasets with local coverage have been pooled thus providing nation-wide coverage. On the other hand, the rating system values datasets with the widest geographical and temporal coverage as well as the highest granularity.

Screenshot from data.gouv.fr home page

The platform will continue to evolve and we hope that other features will soon complete this new version, for example:

  • the ability to browse data by facets (data producers, geographical coverage or license, etc.);
  • a management system for “certified” (clearly labelled institutional producer) and “non-certified” (data modified, produced, added by citizens) versions of a dataset;
  • a tool for previewing data, as natively proposed by CKAN;
  • the ability to comment on the datasets;
  • a tool that would allow to enquire about a dataset directly at the respective public administration.

Given this new version of data.gouv.fr, it is now up to the producers and re-users of public sector data to demonstrate the potential of Open Data. This potential can only be fully met with the release of fundamental public sector data as a founding principle for our society. Thus, we are still awaiting for the opening of business registers, detailed expenditures as well as non-personal data on prescriptions issued by healthcare providers.

Lastly, through the new data.gouv.fr, administrations are no longer solely responsible for the common good that is public sector data. Now this responsibility is shared with all stakeholders. It is thus up to all of us to demonstrate that this is the right choice.


(1) This new version of data.gouv.fr is the result of codesign efforts that the Open Knowledge Foundation France participated in.

(2) Nosdonnees.fr is co-managed by Regards Citoyens and OKF France.

Read Etalab’s press release online here

2013 – A great year for CKAN

Darwin Peltan - December 24, 2013 in CKAN

2013 has seen CKAN and the CKAN community go from strength to strength. Here are some of the highlights.

Screenshot from CKAN demo site

February

May

June

July

August

  • CKAN 2.1 released with new capabilities for managing bulk datasets amongst many other improvements

September

October

  • Substantial new version of CKAN’s geospatial extension, including pycsw and MapBox integration and revised and expanded docs.

November

  • Future City Glasgow launch open.glasgow.gov.uk prototype as part of their TSB funded Future Cities Demonstrator programme

December

Looking forward

The CKAN community is growing incredibly quickly so we’re looking forward to seeing what people do with CKAN in 2014.

So if your city, region or state hasn’t already done so, why not make 2014 the year that you launch your own CKAN powered open data portal?

Download CKAN or contact us if you need help getting started.

This post was cross posted from the CKAN blog

Get Updates