How to make sense out of public data with the help of a spatial analysis and mapping techniques
Introduction
Living myself in the rather large city of Hamburg, public transport is generally considered widely available. However, even though I live in a center location, I always felt a little frustrated due to far distances to public bike stations.
Today, I want to show you how we can use public data to generate insights with the help of a spatial analysis and data visualizations. In detail, we will investigate the following things: 1. The availability of public bikes over the day. 2. The walking distances to bike stations. 3. Whether I am really out of luck or just lazy.
TL;DR
Turns out I’m too lazy! You can find an overview of the results on this streamlit application:
Walking distances to Infrastructure in Hamburg
Availability of Bikes
The data
To answer any kind of question, we need to get our hands on some data first. In this case, we can make use of the public data REST API provided by the city of Hamburg. The data is rather nested and the extraction process not particular interesting (in case you are still interested: you can find the link to the script I wrote here). Essentially, I created a JSON file for each bike station in Hamburg that included the changes of amount of bikes in the last two years and uploaded it to AWS S3. After the cleaning, it had the following structure:
{
“thingID”: 10,
“description”: “StadtRad-Station Mundsburger Brücke / Papenhuder Straße”,
“coordinatesX”: 10.019142,
“coordinatesY”: 53.565952,
“obs”: [
{
“result”: 1,
“resultTime”: “2020–12–23T08:22:02.456Z”,
“observationID”: 45812737
},
{
"result": 18,
"resultTime": "2020-08-24T10:41:24.098Z",
"observationID": 34813796
},
...
]
Data Preparation
In order to work with data, we need to aggregate it in a useful way. After creating a single entry for each hour for each station in the last two years, the dataset already amounted 5.5 million observations. This called for some PySpark, which is a framework that allows to run big data operations on a cluster in a very efficient and fast way.
I’ll skip the details again, as this is not the focus of the article. The script basically loads all the files on S3, flattens the structure and creates one entry per hour for the last two years. After that, it will aggregate each station by hour and weekday and exports everything to a csv file. The file is now much smaller, as we only have 188 data points per station left, leaving us at around 48k rows, which is a much better number to work with.
Visualization
Using the aggregated data and two python libraries called folium and geopandas we can create this animated plot (day of the week is decoded by the leading integer, starting with Monday). There is an obvious pattern to identify here. In the morning of the weekdays, a lot of bikes move from the outskirts towards the city center and return in the evenings. Saturday is starting later, but then a lot more bikes move to the center, whereas Sunday seems to be a lazy day overall, with general less movement.
We can also use a radial plot to analyze a single station, in this case I picked one in the city center and one in the further outskirts. The same pattern as above can be noticed.
If we aggregate all stations by city district and compare the available bikes with the number of inhabitants of that district, we can see a huge imbalance between some districts. The next graph shows the districts with the most and the least amount of bikes per inhabitant. The explanation for the high numbers seems to be rather straight forward. The districts on top are mostly districts in the city center, where lots of people commute to in the morning, raising the amount of bikes available, even though fewer people actually tend to live there.
The districts on the bottom are mostly vast and highly populated districts, that seem to be a little underrepresented by the public bike stations.
How about the difference of deviation of bikes during the day between districts? Again, we can see huge differences here. Centered and industry districts tend to have a large daily deviation of bikes, whereas very remote districts tend to have a consistent amount of bikes during the day. The next two plots gives an example of such a comparison.
Geospatial imbalances are always a concern for ride-sharing institutions. If this turns out to be too much of a concern, potentially a dynamic pricing system could flatten out some of the huge spikes and valleys. For example, making it cheaper to rent a bike from an under demanded station and vice versa.
Walking Distance to the nearest Bike Station
Another important aspect of public bike transports is the walking distances to the nearest bike station. For the purpose of the analysis, we can create a graph network of the city and calculate the shortest walking distance to the nearest station for each point (Node) on our map. For a more technical explanation, you can check out the recent article I published on the topic.
In the graph below (for the interactive version, follow this link) you can see the average walking distance from most points in Hamburg to the nearest bike station. The brighter the color, the fewer minutes you need to walk. This can range from 0 minutes in bright yellow (if you're standing next to a station) to over 50 minutes in purple.
You can not only identify the bike stations without any explicit markers, but also already identify areas where there seems to be a lack of access to bike stations. In a perfect world, we could ask the city to place a station on any purple node and would therefore reduce the walking distance for most people. But in order to do this more economically, we also need to consider population and population density.
I would make the case, that as a city planer you want to help as many people as possible with the least resources. For that reason, you wouldn't place new stations to places that are currently the furthest away from an existing station, but rather placing new stations at places that are far away for many people.
As an approximation, we can use the population of each district and assume that the number is evenly distributed over its area (which of course it is not, but we got to start somewhere). If we use the average population per node on our map depending on the district it’s located inside and then multiply it by the distance of that node, we create a metric that might be suitable to optimize for.
Weighted Walking Distance = Population * Walking Distance
With the population, we can also calculate the average walking distance to the nearest station: 18.42 Minutes. Turns out, with my 10 minutes, I’m not as bad off as I thought.
Plotting the nodes again, for the districts within Hamburg using the weighted walking distance, we get the following Map:
This map now allows us to see the real deal. Areas which appear to have decent walking distances are now worse off due to a high population density, whereas more remote districts now perform a little better. We can also see that the bright yellow city center is indicating a high availability of bike stations for a considerable small population.
Starting with this map, we could now begin to place new hypothetical stations and try to optimize our network for the weighted distance KPI we created. I am using a very straight forward algorithm for this:
1. Pick the node with the highest weighted walking distance
2. Place a new station there
3. Recalculate all the weighted distances
4. Repeat step 1 to 3 for each new station
Now we let the algorithm run 100 times. You can see the performance of the algorithm in the next plot:
As you can see, the algorithm managed to reduce the average amount of walking distance by over 4 minutes, roughly a reduction of over 20%. But we can also see that the algorithm is slowly losing its effect with each new station placed. Let’s have a look at the results on our map. On the left we see the original distance map and on the right we see the new version with the hundred new stations. We can already tell, that some of the further nodes have been supported by the new stations.
It gets more interesting though, if we have a look at the comparison between the weighted walking distance maps. Now, we can really tell that the algorithm did its job. Most of the problematic zones have been successfully eliminated. Even if some of those stations might not be super economical worthwhile, I would argue in general this approach worked out quite well and could be a potential starting point for future city planning.
This brings us to the end of this small analysis. There is still much more one could dig into, but I hoped you learned something new today and got a sense of how to use public data for insights that could potentially be useful.
If you have some questions or want more information, feel free to reach out to me: Felix Ude on LinkedIn