This blog post was written by the members of the Sinar project in Malaysia
In Malaysia, Sinar Project with the support of Open Knowledge International organised a one-day data expedition based on the guide from School of Data to search for data related to government provision of health and education services. This brought together a group of people with diverse skills to formulate questions of public interest. The data sourced would be used for analysis and visualisation in order to provide answers.
Data Expedition
A data expedition is a quest to explore uncharted areas of data and report on those findings. The participants with different skillsets gathered throughout the day at the Sinar Project office. Together they explored data relating to schools and clinics to see what data and analysis methods are available to gain insights on the public service provision for education and health.
We used the guides and outlines for the data expedition from School of Data website. The role playing guides worked as a great ice breaker. There was healthy competition on who could draw the best giraffes for those wanting to prove their mettle as a designer for the team.
Deciding what to explore, education or health?
The storyteller in the team, who was a professional journalist started out with a few questions to explore.
- Are there villages or towns which are far away from schools?
- Are there villages or towns which are far away from clinics and hospitals?
- What is the population density and provision of clinics and schools?
The scouts then went on a preliminary exploration for whether this data exists.
Looking for the Lost City of Open Data
The Scouts, with the aid of the rest of the team, looked for data that could answer the questions. They found a lot of usable data from the Malaysian government open data portal data.gov.my. This data included lists of all public schools and clinics with addresses, as well as numbers of teachers for each district.
It was decided by the team that given the time limitation, the focus would be to answer the questions on education data. Another priority was to find data relating to class sizes to see if schools are overcrowded or not. Below you can see the data that the team found.
Education
Open Data
- List of schools in Malaysia with addresses from Source: data.gov.my
- Number of Teachers for each education district (Selangor) Source: data.gov.my
- Bilangan PraSekolah or Pre-Schools in Terengganu Source: data.gov.my
- Direktori Pengendali Bas, Directory of School Bus Operators by registration numbers and by State Source: data.gov.my
- Keciciran Orang Asli ( Indigenous school dropout stats) Source: data.gov.my
Data in Reports
- Planning Report of Ministry of Education Document 2013 has some additional detailed stats, but not as open data
- UNESCO Country Profile
Definitions
Not all schools are created equal, there are different types, some are considered as high achieving schools or Sekolah Berprestasi Tinggi
Health
Open Data
- List of Kilinik Kesihatan with full address Source: data.gov.my
- List of 1Malaysia Klinik with full address Source: data.gov.my
- Klinik Desa (Rural Clinics) with full address Source: data.gov.my
- Hospitals, with code, geolocation and address Source: data.gov.my
- List of Malaysian gov dental clinic (Incomplete, Johor Only) Source: data.gov.my
- WHO Malaysian statistics
GIS
- 1MalaysiaMaps
- Selangor PBT (Local Council) Admin Boundary Source: Tindak Malaysia
- Selangor PAR (Parliament) Electoral Boundary Source: Tindak Malaysia
- Selangor DUN (State Assembly) Electoral Boundary Source: Tindak Malaysia
- POI – http://data.gov.my/view.php?view=189
Other Data
CIDB Construction Projects contains relevant information such as construction of schools and clinics Script to import into Elastic Search
Budgets
Sinar Project had some budgets as open data, at state and federal levels that could be used as additional reference point. These were created as part of the Open Spending project.
Selangor State Government
http://data.sinarproject.org/dataset/selangor-state-government-2015-budget
Federal Government
Higher education
- https://docs.google.com/spreadsheets/d/1deOUIxWKWFeqPK51ioeHEE_rCiiNJ9UyZxCBZxw5Ab0/edit?usp=sharing
- http://www.treasury.gov.my/pdf/bajet/maklumat_bajet_kerajaan/2015/b64.pdf
- http://data.sinarproject.org/dataset/ministry-of-education-higher-education-budget-2015
Education
- https://docs.google.com/spreadsheets/d/1mVl0IEbOtwZHjSTg6OLRHjmtn_5sVrRzyTMztUQ6eDA/edit?usp=sharing
- http://www.treasury.gov.my/pdf/bajet/maklumat_bajet_kerajaan/2015/b63.pdf
- http://data.sinarproject.org/dataset/ministry-of-education-education-budget-2015
Methodology
The team opted to focus on the available datasets to answer questions about education provision, by first converting all school addresses into geocoding, and then looking at joining up data to find out the relationship between enrollments, school and teacher ratios.
Joining up data
To join up data; the different data sets such as teacher numbers and schools, VLOOKUP function in Excel was used to join by School code.
Converting Address to geolocation (latlong)
To convert street addresses to latitude, longitude coordinates we used the dataset with the cleansed address’ along with a geocoding tool csvgeocode
./node_modules/.bin/csvgeocode ./input.csv ./output.csv --url "https://maps.googleapis.com/maps/api/geocode/json?address={{Alamat}}&key=" --verbose
Convert the completed CSV to GeoJSON points
Use the csv2geojson
<span style="font-weight: 400;">csv2geojson --lat "Lat" --lon "Lng" Selangor_Joined_Up_Moe.csv</span>
To get population by PBT
Use the data from state economic planning unit agency site for socio-economic data specifically section Jadual 8
To get all the schools separated by individual PBT (District)
UseGeoJSON of Schools data and PBT Boundary loaded into QGIS; and use the Vector > Geo-processing > Intersect.
A post from Stack Exchange suggests it might be better to use Vector > Spatial Query > Spatial Query option.
Open Datasets Generated
The cleansed and joined up datasets created during this expedition are made available on GitHub. While the focus was on education, due to the similarity in available data, the methods were also applied to clinics also. See it on our repository – https://github.com/Sinar/SinarODD2016
Visualizations
All Primary and Secondary Schools on a Map with Google Fusion Tables
https://www.google.com/fusiontables/DataSource?docid=1lVyjIIEm_McqmiSEfQY5vecrhqRjmaJ1wzdiEo1q#map:id=7
Teacher to Students per school ratios
https://www.google.com/fusiontables/DataSource?docid=18ieB8OqzpK3Ch9KcD4BiiADdmk8SXnS0x_IINxHc#map:id=3
Discovery
- Teachers vs enrollment did not provide data relating to class size or overcrowding
- Demographic datasets to measure schools to eligible population
- More school datasets required for teachers, specifically by subject and class ratios
- Methods used for location of schools can also be applied to clinics & hospital data
It was discovered that additional data was needed to provide useful information on the quality of education. There was not enough demographic data found to check against the number of schools in a particular district. Teacher to student ratio was also not a good indicator of problems reported in the news. The teacher to enrollment ratios was generally very low with a mean of 13 and median of 14. What was needed, was ratio by subject teachers, class size or against the population of eligible children of each area, to provide better insights.
Automatically calculating the distance from points was also considered and matched up with whether there are school bus operators in the area. This was discussed because the distance from schools may not be relevant for rural areas, where there were not enough children to warrant a school within the distance policy. A tool to check distance from a point to the nearest school could be built with the data made available. This could be useful for civil society to use data as evidence to prove that distance was too far or transport not provided for some communities.
Demographic data was found for local councils; this could be used by researchers using local council boundary data on whether there were enough schools against the population of local councils. Interestingly in Malaysia, education is under Federal government and despite having state and local education departments, the administrative boundaries do not match up with local council boundaries or electoral boundaries. This is a planning coordination challenge for policy makers. Administrative local council boundary data was made available as open data thanks to the efforts of another civil society group Tindak Malaysia, which scanned and digitized the electoral and administrative boundaries manually.
Running future expeditions
This was a one day expedition so it was time limited. For running these brief expeditions we learned the following:
- Focus and narrow down expedition to specific issue
- Be better prepared, scout for available datasets beforehand and determine topic
- Focus on central repository or wiki of available data
Thank you to all of the wonderful contributors to the data expedition:
- Lim Hui Ying (Storyteller)
- Haris Subandie (Engineer)
- Jack Khor (Designer)
- Chow Chee Leong (Analyst)
- Donaldson Tan (Engineer)
- Michael Leow (Engineer)
- Sze Ming (Designer)
- Swee Meng (Engineer)
- Hazwany (Nany) Jamaluddin (Analyst)
- Loo (Scout)
360Giving Data Lab and Learning Manager, ex OKF International Community Coordinator