International Open Data DayThis blog post was written by the members of the Sinar project in Malaysia 

In Malaysia, Sinar Project with the support of Open Knowledge International organised a one-day data expedition based on the guide from School of Data to search for data related to government provision of health and education services. This brought together a group of people with diverse skills to formulate questions of public interest. The data sourced would be used for analysis and visualisation in order to provide answers.

Data Expedition

School of Data D&D Character Sheet

GiraffeA data expedition is a quest to explore uncharted areas of data and report on those findings. The participants with different skillsets gathered throughout the day at the Sinar Project office. Together they explored data relating to schools and clinics to see what data and analysis methods are available to gain insights on the public service provision for education and health.

We used the guides and outlines for the data expedition from School of Data website. The role playing guides worked as a great ice breaker. There was healthy competition on who could draw the best giraffes for those wanting to prove their mettle as a designer for the team.

 

 


Deciding what to explore, education or health?

The storyteller in the team, who was a professional journalist started out with a few questions to explore.

  • Are there villages or towns which are far away from schools?
  • Are there villages or towns which are far away from clinics and hospitals?
  • What is the population density and provision of clinics and schools?

The scouts then went on a preliminary exploration for whether this data exists.

Looking for the Lost City of Open Data

The Scouts, with the aid of the rest of the team, looked for data that could answer the questions. They found a lot of usable data from the Malaysian government open data portal data.gov.my. This data included lists of all public schools and clinics with addresses, as well as numbers of teachers for each district.

It was decided by the team that given the time limitation, the focus would be to answer the questions on education data. Another priority was to find data relating to class sizes to see if schools are overcrowded or not. Below you can see the data that the team found. 

School of Data D&D Character Sheet 2

Education

Open Data

Data in Reports

 

Definitions

Not all schools are created equal, there are different types, some are considered as high achieving schools or Sekolah Berprestasi Tinggi

Health

Open Data

GIS

 

Other Data

CIDB Construction Projects contains relevant information such as construction of schools and clinics Script to import into Elastic Search

Budgets

Sinar Project had some budgets as open data, at state and federal levels that could be used as additional reference point. These were created as part of the Open Spending project.

Selangor State Government

http://data.sinarproject.org/dataset/selangor-state-government-2015-budget

Federal Government

Higher education
Education

Participants

Methodology

The team opted to focus on the available datasets to answer questions about education provision, by first converting all school addresses into geocoding, and then looking at joining up data to find out the relationship between enrollments, school and teacher ratios.

Joining up data

To join up data; the different data sets such as teacher numbers and schools, VLOOKUP function in Excel was used to join by School code.

Converting Address to geolocation (latlong)

To convert street addresses to latitude, longitude coordinates we used the dataset with the cleansed address’ along with a geocoding tool csvgeocode

./node_modules/.bin/csvgeocode ./input.csv ./output.csv --url "https://maps.googleapis.com/maps/api/geocode/json?address={{Alamat}}&key=" --verbose

Convert the completed CSV to GeoJSON points

Use the  csv2geojson

<span style="font-weight: 400;">csv2geojson --lat "Lat" --lon "Lng" Selangor_Joined_Up_Moe.csv</span>

To get population by PBT

Use the data from state economic planning unit agency site for socio-economic data specifically section Jadual 8

To get all the schools separated by individual PBT (District)

UseGeoJSON of Schools data and PBT Boundary loaded into QGIS; and use the Vector > Geo-processing > Intersect.  

A post from Stack Exchange suggests  it might be better to use Vector > Spatial Query > Spatial Query option.

Open Datasets Generated

The cleansed and joined up datasets created during this expedition are made available on GitHub. While the focus was on education, due to the similarity in available data, the methods were also applied to clinics also. See it on our repository – https://github.com/Sinar/SinarODD2016

Visualizations

All Primary and Secondary Schools on a Map with Google Fusion Tables

All Primary and Secondary Schools on a Map with Google Fusion Tables

https://www.google.com/fusiontables/DataSource?docid=1lVyjIIEm_McqmiSEfQY5vecrhqRjmaJ1wzdiEo1q#map:id=7

Teacher to Students per school ratios

Teacher to Students per school ratios

https://www.google.com/fusiontables/DataSource?docid=18ieB8OqzpK3Ch9KcD4BiiADdmk8SXnS0x_IINxHc#map:id=3

 

Discovery

  • Teachers vs enrollment did not provide data relating to class size or overcrowding
  • Demographic datasets to measure schools to eligible population
  • More school datasets required for teachers, specifically by subject and class ratios
  • Methods used for location of schools can also be applied to clinics & hospital data

It was discovered that additional data was needed to provide useful information on the quality of education. There was not enough demographic data found to check against the number of schools in a particular district. Teacher to student ratio was also not a good indicator of problems reported in the news. The teacher to enrollment ratios was generally very low with a mean of 13 and median of 14. What was needed, was ratio by subject teachers, class size or against the population of eligible children of each area, to provide better insights.

Automatically calculating the distance from points was also considered and matched up with whether there are school bus operators in the area. This was discussed because the distance from schools may not be relevant for rural areas, where there were not enough children to warrant a school within the distance policy. A tool to check distance from a point to the nearest school could be built with the data made available. This could be useful for civil society to use data as evidence to prove that distance was too far or transport not provided for some communities.

Demographic data was found for local councils; this could be used by researchers using local council boundary data on whether there were enough schools against the population of local councils. Interestingly in Malaysia, education is under Federal government and despite having state and local education departments, the administrative boundaries do not match up with local council boundaries or electoral boundaries. This is a planning coordination challenge for policy makers. Administrative local council boundary data was made available as open data thanks to the efforts of another civil society group Tindak Malaysia, which scanned and digitized the electoral and administrative boundaries manually.

Running future expeditions

This was a one day expedition so it was time limited. For running these brief expeditions we learned the following:

  • Focus and narrow down expedition to specific issue
  • Be better prepared, scout for available datasets beforehand and determine topic
  • Focus on central repository or wiki of available data

Thank you to all of the wonderful contributors to the data expedition:

  • Lim Hui Ying (Storyteller)
  • Haris Subandie (Engineer)
  • Jack Khor (Designer)
  • Chow Chee Leong (Analyst)
  • Donaldson Tan (Engineer)
  • Michael Leow (Engineer)
  • Sze Ming (Designer)
  • Swee Meng (Engineer)
  • Hazwany (Nany) Jamaluddin (Analyst)
  • Loo (Scout)
+ posts

360Giving Data Lab and Learning Manager, ex OKF International Community Coordinator