Humans have a innate instinct to connect with nature and other living beings (1). One such connection is through watching birds in their habitat. Indeed, birdwatching is a popular hobby around the world (2). Some people, including the author, can have stronger need to connect with nature more than others. This need can influence their buying decisions, such as choosing a place of residence. In connection with the birdwatching activity, factors that maximize the birdwatching experience in an area, such as bird abundance and diversity, can affect whether a birdwatching enthusiast would decide to live in that area or not.
Singapore is a country in South East Asia. The country is a well developed island city state with numerous venues or facilities such as restaurants, medical centers, malls, schools, parks, and transit systems all over the country built by/for the people. It was chosen as the subject area of this study because the bird observation data from there is extensive, the area is not too large but also not too small, it is a city that the author has visited several times, and the Foursquare API data from the area (that this study used) seemed updated enough.
The goal of this study is to help birdwatching enthusiasts choose an area to live in Singapore with maximum birdwatching experience while also considering availability of important venues using machine learning.
There are two datasets that were used in this study. The first is Singapore bird observation data obtained from The Global Biodiversity Information Facility (GBIF) (3). This dataset contains 423211 rows of Singapore bird observations from year 1800 to 2020. The second dataset is the venues data from the Foursquare API.
A query in GBIF was made for all available records data for birds of Singapore. The resulting dataset consists of 423211 rows and 249 columns. Data preprocessing The acquired dataset was then cleaned from unused columns and missing data. The process was conducted with Microsoft Excel. The resulting dataset contained 401607 rows and 4 columns. The 4 columns were: latitude, longitude, species name, species id.
To convert the data points into a more manageable form, a clustering was performed using latitude and longitude features of the data. The result was 40 clusters of data, which we called sectors, from sector 0 to sector 39. The quantity of 40 was chosen because the resulting clusters seemed to have the right sizes, specifically for exploration through walking or taking a short trip on bus. For each sectors, a center was determined.
For each of the 40 sector centers, 3 API calls were made to the Foursquare API. Each API calls requested venues of certain categories within 1 km radius of the centers. The first API call requested venues data of the category ‘Food’ that contained restaurants, cafes, bars, etc. This first API call requested 10 results. The second and third API requested venue data of categories that were considered important and desirable features for a person to choose to live in the area. Such categories were: residence area, transit system, medical center, convenience store, mall, park, school, university, nature reserve. The second API call requested 50 results, while the third call requested 40. So for each of the 40 sector centers, the maximum API call result was 100 venues within 1 km of the centers.
The top 10 most popular venue types were calculated for each sectors. The species data were aggregated to obtain the number of uniques species for each sectors. To prepare for cluster analysis, the venue types data were transformed into integer data with one-hot encoding using the mean.
Clustering analysis were implemented on the prepared data using k-means clustering algorithm. To determine the optimum number of k, Yellowbrick library were used. The elbow method visualization were produced quickly with Yellowbrick. The resulting cluster data with the optimum k were then merged with the species numbers data for each sectors and the sectors data that contain the top 5 most popular venue types for each sectors.
The final clustering result were then mapped using Tableau and Folium.
The Singapore bird observation data used in this study consists of 401067 rows and are plotted as follows:
K-means clustering algorithm was implemented with the above Singapore bird observation data with 40 centroids. The result is plotted as follows:
The centroids were then assigned as sectors. The sectors are plotted in map with Folium as follows:
After collecting the venues and species number data for each sector, we again ran k-means clustering algorithm with the sectors data. Before that, 3 sectors were ommitted because there were no venues nearby.
The elbow method suggested 13 as the optimum number of k. However, as the curve seems continuous, it is expected that some of the resulting clusters would show high overlap with another.
K-means algorithm were them implemented again with the sectors, producing the clusters that were mapped as follows:
Figure 5. Map of centers of clusters of these sectors that have been implemented with k-means algorithm.
The table for the resulting clustering of sectors is as follows:
The clustering of Singapore area based on birds observation data and nearby venues produced 13 clusters. After considering the features of each clusters, the clusters can be named based on their characteristics as follows.
Cluster 0 : Zoo
Cluster 1 : Food
Cluster 2 : Residence 1
Cluster 3 : Mall
Cluster 4 : Foo & Nature Reserve
Cluster 5 : Hospital
Cluster 6 : Bus Transit
Cluster 7 : Coast 1
Cluster 8 : Park
Cluster 9 : Theme Park
Cluster 10 : Residence 2
Cluster 11 : College
Cluster 12 : Coast 2
Figure 7. Maps of bird species diversity / venue types with cluster names.
The cluster that contains the most sectors is Residence 1 (10 sectors). Clusters Residence 2, College, Food & Nature Reserve, and Food have the second most sectors with 4 sectors each.
There are multiple approach to using the resulting information to decide which area to choose. One of the ways is to simply sort the species number of all sectors and then select which of the top sectors have the most favorable venues. In this discussion part, we will decide which area to choose by first choosing the clusters we would like to live in, sort the bird species numbers and then select a few sectors from those for more specific consideration.
First, we would like to choose the Residence 1 and Residence 2 clusters. These clusters are dominated by residential areas. The assumption is that the more extensive the residence are, the more the house price would decrease because of competition. So we choose those Residence clusters because there may be more alternatives in terms of price range and the overall price may be less expensive than the houses in the less denser area.
The final selection of the area to choose to live in would depend on respective users. For this discussion, we will use the author’s preferences. Based on the selected cluster, sorted bird species numbers, and the features of the sector, we choose sector 23 and sector 16. The considerations are:
• These sectors have relatively high bird species diversity
• The residence area of these sectors are quite dominant, so the housing prices may be lower
• These sectors have desirable venues, namely mall, convenience store, clinic, bus transit, food, park
After further checking in Google Maps, it is known that the sector 23 (lat/lon: 1.377628/103.949707) is Pasir Ris area on the northeast coast of Singapore island near Changi Airport. Sector 16 (lat/lon: 1.275609/103.811885) is Bukit Merah area on the south part of Singapore. So, based on the analysis on this study, the choice for the author’s place of residence where there are maximum experience for birdwatching and there are more desirable venues nearby are:
• Primary: Pasir Ris area
• Secondary: Bukit Merah area
Pasir Ris is also the preferred choice of residence for the author because from there he would have shorter access to the Changi airport any time he wants to come home to his hometown of Jakarta, Indonesia.
In this study we have analyzed Singapore bird observation data and Foursquare API venues data to produce insights that may be useful for birdwatchers find the best area to live in Singapore. The insights may point birdwatchers to areas where there are maximum birdwatching experience and desireable venues. The results of the study were maps of 13 clusters that consisted of 37 sectors and table that listed most common venues for each sectors. The final decision will depend on each users’ needs, desireable features of an area, and other information that are not captured in the datasets. However, the analysis, maps and table may serve as useful baseline information for real-life use case of birdwatchers choosing an area of residence in Singapore.
All datasets, Python scripts, and resulting images from this study can be found here.
(1) Vidovich, E. Bringing the Outdoors In: The Benefits of Biophilia. NRDC: Expert Blog. Last accessed on 21/01/2021. Link.
(2) White, J. How popular is birdwatching? Chirpbirding.com. Last accessed on 21/01/2021. Link.
(3) GBIF.org (13 January 2021). GBIF Occurrence. Download.