Over the past six weeks the UN Global Platform has been collaborating with the UK’s Data Science Campus to develop an implementation of the Urban Forest project onto the cloud-based platform. The Urban Forests project aims to create an index of vegetation through taking Google street view images around the road network and classifying the amount of vegetation in each image. High level information on the project is available here, and a full report can be read here.
Largely, the process of implementeing this on the platform has involved development of a series of algorithms in the methods service which is supplied by Algorithmia. The methods service allows for methods to be written in a range of languages (R, Python, Java) which can then be called from a wide range of languages. This allows for a combination of languages to easily be used in a single pipeline. Each part of the pipeline is contained within it’s own algorithm, allowing them to be called independtly of one another. This post will walk through the creation of a number of these algorithms, how these have been combined to form a pipeline and how this could be reused.
Sampling the road network
The first key component of the pipeline is to sample a requested geographical area. This might be a street, a parish or borough, or an entire city. The original Urban Forests project downloaded the open street map data for a requested number of roads. In the implementation on the platform, we have created a number of algorithms that allow a query to generate evenly spaced points along all highways in that queried geographical area.
In R, we’ve used the
osmdata package to geocode queries[@osmdata].This package queries open street map and can return results in a number of formates; for our usecase we have used a SpatialLinesDataFrame object. The algorithm we have created has wrapped the querying function, allowing either a bounding box, bounding box + street name or area/city name to be used as the query. We have set the
osmdata query to return all highways with a name and that the object should return the osmlines in
sp format. The benefit of converting into
sp format is that we then have access to a range of the spatial analysis packages in R, a couple of which are used in the next sections.
At this point it’s important to note that
osmdata uses the overpass API to return the open street maps data. While this works relatively well, this may not scale well if many users begin to use this algorithm. We are currently investigating novel methods for accessing OSM data.
At this stage we want to place points at regularly spaced intervals along the highway(s). Intervals can be specified by the user. We now split each SpatialLinesDataFrame into individual spatial lines objects. Each of these lines will have a unique
way_id. Using the
sp package, we then measure the length of road and use this to calculate the number of points that we would need along that line. We then use
rgeos which can take a SpatialLines objet and place points evenly spaced along a line by setting
normalised = TRUE. Using this setting, a value of 0.5 will place a point half way along the line; therefore creating a sequence between 0 and 1 with the length set to the number of points we require will return the desired coorinates. As the lines object is stored as a list, we can use
lapply to quickly apply this function to all SpatialLines in the SpatialLinesDataFrame.
At this point, each
way_id is a collection of longitude and latitude coordinates, representing the evenly spaced points along the highway. The final stage of the sampling algorithm is to calculate the direction of each point along a road. This was achieved using the
earth.bear function. This function requires two coordinates and returns a heading with 0 degrees being north, 180 degrees being south etc. For each point we use the point before and after to calculate that coordinates bearing - while there are geometries where this will not be perfect, it gives a fairly good estimate of direction.
Clearly this isn’t possible for the first or final points in a
way_id, so we gives these points the direction of it’s closest point.
Images were are downloaded using the Google street-view API. For this reason, anyone wanting to use this pipeline will require their own street-view api key. The input for this algorithm is an object with values for:
latitude list and
heading. The algorithm will then create a new folder called
way_id_<way_id_num> that all the images will be downloaded to. It then loops over each point in the
way_id, and captures two images: one on the left and one on the right of the road. It does this by adding and subtracting 90 degrees to the heading. Each image is saved in it’s respective
way_id folder, named in the format
The image segmentation algorithm is the same algorithm as used in the Data Science Campus’ Urban Forest project. It implements the PSPnet [@PSPnet], a convolutional neural network that achieves high levels of accuracy through a combination of scene-parsing with pixel-level prediction. The implementation in the methods service takes an input folder and output folder as arguments. All images in the input will be processed and the outputs saved to the output folder. As these will not necessarily be processed in order, the naming convention that we have used becomes very useful here. The segmented grey output images are saved to a the output folder, named
A composite image combines the downloaded image with the pixel predictions as a coloured mask. These can be used to visualise what the network is doing and how well it is performing.
The vegetation index simply returns the percentage of pixels classified as ‘vegetation’in an image. This can then be averaged for each point (as there as two images being used at each point) or over a larger geographical area such as a street or parish.
The vegetation index algorithm in it’s current implementation will only be able to return an average value for each
way_id or larger geography composed of multiple way_ids. This is a consequence of the implementation of the segmentation algorithm above, which will not necessarily process each image in order.
Asycnronous Processing Pipeline
One key benefit of the platform is the ability to make multiple API calls to the same algorithm, allowing for parallel processing. This has the potential to significantly reduce processing times, particularly when you’re processing a large geographical area. We have achieved this parallel processing by using
asyncio a library that now comes with python for concurrent processing. Using concurrent processing, you need to ensure that you avoid race conditions; where two processes access and process the same data at once. To avoid this, we have essentially created a pipeline to download, segment and create composite images. We split each query into a set of inputs, each input being a
way_id with associated coordinates. We then create a set of queues and evenly split the these inputs between the queues. Each input in the queue is then processed in turn, while all queues will be processed simultaneously. This ensures that no
way_id is processed more than once.
The pipeline can be called in two main ways - firstly there is in interactive notebook. This gives a clear description of how the algorithms work and how the results of one algorithm are then used as the input for the next algorithm. Alternatively, we have wrapped the entire pipeline into a single algorithm call. The user simply inputs the query, their street view api key and the spacing between points they want to use, and the pipeline will run until completion.
All these components are open on the methods service for reuse. For urban analytics, the road sampling method may be very useful; the code is open source and therefore can be made to return a geojson file rather than a list of coordinates. The sampler in combination with the image downloader can be used to create new data sets which can be labelled and then used for novel machine learning projects. If you have a use case for these methods, please get in touch at [email protected]