Sometimes it may be necessary to simplify a set of points into a network. It may be for qualitative reasons such as determining intersection points and conduits of a network. It may also be for quantitative reasons like shrinking a dataset and speeding up computations. In this post I will go over a case study of GPS data and how I to turn this data into a smaller and more usable dataset.

Most people know this idea from simple linear approximation. In innumerable applications it is useful to see if there is a linear approximation between data. But when is that too simple? Consider the following picture: It shows three possible paths given the same data points. Imagine that the red dots are measured GPS points. The black lines are three possible paths that a car may have taken along these measured points. They are all reasonable. Maybe the car was in a parking lot or in a field, not on a nicely paved road.

If you look at the leftmost route you may notice that we could just as easily describe the route with **only four points**, the points where the slope changes. The red dots are included in the lines that connect them. We have substantially reduced the number of points that describe the same path.

There is a fun algorithm called the Ramer–Douglas–Peucker algorithm that can help us with this problem. The basic idea is to take a curve that is defined by a set of points and reduce it based on a supplied tolerance. First, we connect the starting and ending point and connect them with a line. Then, we find the point that is furthest from the line that connects these two points. If it is less than the tolerance then we are done, otherwise, this becomes a new starting (or ending) point. Then we repeat. From the Wikipedia we have this great animated gif that visually explains the process.

Let's see how this works in a real-world situation.

### Applying to GPS data

Consider the situation where we are given a set of GPS data. GPS data can be highly inaccurate. If there are trees or buildings in the way it can be off by as much as 40 meters. The image below shows some GPS data I took from my phone while walking home one day. It looks inaccurate but one could reasonably argue my path is pretty clear. I walked down the two roads along which the points outline. But what if I told you a took a back alley, and then crossed the road. Does it matter? Maybe or maybe not. I have applied the Ramer–Douglas–Peucker (RDP) algorithm to this data for various tolerances. The top image is for a small tolerance, the lower for a more aggressive tolerance.

The lower map shows what our intuition expects; the route is three nodes connected by two paths.

The last example shows how powerful this algorithm can be. I had a dataset consisting of around 2500 data points. The data comes from me logging a run in Gedera, Israel. The RDP algorithm reduced it to 26 points and still captured both my route AND the intersections I encountered.

The python code for the RDP algorthm is here (I found it on the web and I forgot to annotate where :()

[code language="python"] def DPbasic(lat_lon,tol): N = len(lat_lon) if N < 2: print "not enough points" d_max = 0.0 index = 0 for i in range(1,N-1): dist = dist_point_line(lat_lon[0],lat_lon[-1],lat_lon[i]) if dist > d_max: index = i d_max = dist if d_max >= tol: network = DPbasic(lat_lon[:index+1], tol)[:-1] + DPbasic(lat_lon[index:], tol) else: network = [lat_lon[0],lat_lon[-1]] return network [/code]

### What's it good for, anyways?

Maybe this algorithm is interesting but how could we use it in a clever manner? Imagine you have a dataset with hundreds or thousands of runs or walks. And each of those contains timestamped latitude and longitude points. Moreover, what if I told you there were actually TWO people that made this data and not one. And what if your task was to identify each runner's routes? How could you do it?

This is a common machine learning problem: To identify users based on their "fingerprint" from some sort of dataset. The major advantage to using the RDP in this this case is that your computation time has gone down from days or hours to seconds.

To identify which path belonged to which runner you could now apply whatever machine learning algorithm you want to a substantially smaller data set. Instead of looking at velocity and acceleration on 2500 points, it's now around 28.

The RDP algorithm is a cute method for simplifying your dataset and turning a dense network into a sparser and less curvy network. It can simplify computations and, in an application like GPS, allow you to **identify structures such as intersections**. Good fun!