Some form of events are common in most software systems. If those events are logged with a timestamp, you get an ordered number sequence (if the timestamp is interpreted as a number). Sometimes it is desirable to compare two logs to each other and evaluate how similar they are. E.g. two users want to determine how similar they performed when completing a certain task, or comparing the user performance to the optimal event sequence. In a current project, I use the comparison to evaluate how well a user clapped a pre-defined rhythm. In the following article I am proposing an algorithm to solve this task where the sequences are sorted and may contain some errors. Although the algorithm has a quite high time complexity (), it yields convincing results and is well-applicable for small sequences.
In the following, I will handle one sequence as the
input and one sequence as the
target. The target sequence will stay fixed, whereas the element in the input sequence might be shifted.
The given sequences may contain errors that might be caused by the user or systematic delays. The following algorithm can handle the following errors:
- Both sequences may be shifted against each other. This might be caused by the recording method. The amount of shift is unknown.
- Elements in the input sequence may be slightly off, i.e. they can contain noise.
- There may be elements in either list that do not have a matching partner in the other list.
Evaluating a match’s quality
There plenty of possible quality functions. However, two functions are especially useful: The sum of absolute differences and the sum of squared differences. For all matched pairs
target_i / input_i:
The squared difference is widely used. However, it tends to prioritize optimizing pairs that have a big difference, even if it means that pairs with small differences have to be torn farther apart. That’s because a pair with difference 4 is not only twice as bad as a pair with difference 2, but even four times as bad (=2^2).
The absolute difference does not have this problem. It handles each pair with the same amount of attention. However, this function is not continuously differentiable which makes the optimization a bit harder.
The proposed algorithm can handle both objective functions, but I will focus on the absolute difference.
The algorithm is similar to the Iterative Closest Point algorithm that is used to align point clouds to each other. However, since we are in 1D, the algorithm’s parts can be designed more powerful.
The basic idea is to split the problem into two parts: The estimation of the overall shift and the matching of elements. Both parts will be executed iteratively. We will start with a rough matching. Then the overall shift will be optimized for this matching. This may yield a new better matching. This is done until we arrive at an optimal solution.
Let’s have a closer look at both parts:
Estimating the overall shift
Assume that we already have a valid matching of input points to target points. We now want to optimize the offset (the overall shift) with respect to minimizing the objective function. The objective functions above did not contain the shift, so we have to introduce it.
Let’s start with the squared difference function:
In order to find the minimum, we can derive it with respect to and solve the derivative for 0:
Solving for 0 yields
This is simply the mean of differences or the difference of means.
The second objective function is a bit trickier. Its derivative is not continuous and there is no single point where the derivative is zero. There is either none or infinitely many. Anyway, we have to think of another way to minimize this function.
Without going too much into detail, the minimizer of the absolute difference is the median of .
We can now optimize the overall shift, given a valid matching. Let’s take a look at how to calculate the best matching, given a valid offset.
Calculating the optimal matching
Finding the optimal matching of input elements to target elements is another minimization problem, though a discrete one. Our chosen objective function (squared or absolute differences) provides the basis for this problem’s objective. We can assign each input/target pair a penalty which is equal to their distance. Let’s visualize this as a table:
We could just find the minimum for each target, but that would not account for all properties of our model. E.g. we could not prevent the matching to go back in time (i.e. map input 1 to target 3 and input 2 to target 2). In the example above, target 2 does not really have a good input, but input1 matches it best. input1 is already matched to target1. So if two targets are matched by the same input, the second one will just be skipped.
In order to make our model more powerful, we introduce edges between the nodes. Each edge can bear an additional penalty. This penalty is intuitively the number of skipped elements (either inputs or targets) multiplied with a constant factor (the skipPenalty). We can now leave edges away that go back in time. Going horizontally would mean to skip a target (i.e. penalty of 1); going diagonally would mean to skip nothing (i.e. penalty of 0); and going further down would mean to skip inputs (with according penalty). Here are the outgoing edges for two representative nodes:
The task is then to find a path from target0 to target4 with a minimum overall penalty. However, there is one tiny specialty: If we go along a horizontal edge, the node at the end should not add to the overall penalty, because this target will be skipped anyway. The nodes of the last column need to include an additional penalty for skipped elements, because there is no outgoing edge that could be used to model this.
Finding an algorithm to solve the matching problem
Now that we have stated the matching problem, it is time to find an algorithm to do that. It turns out that Dynamic Programming is a suitable approach. Here is the idea:
Start at the first target and initialize the penalties for all inputs (based on their distance and the skipPenalty).
Now continue to the next target and analyze every input. Every node in this column has an optimal predecessor (which minimizes its penalty + the edge’s penalty). We can find this optimal predecessor by iterating all inputs of the previous column and taking the minimum element. So if our path goes through the current node, it must also go through the optimal predecessor. We store the predecessor and the sum of the path’s penalty up to the current position at the node (= predecessor’s penalty + edge’s penalty + own penalty).
This procedure continues. Thus we propagate the path’s overall penalty to the last target. Now we can find the minimum node of the last column. This is the element where the path with the minimum overall penalty finished. So that’s the mapping for the last target. We have stored its optimal predecessor, so all we have to do is follow this trail down to the first target and we have our minimum path along with the optimal mapping.
Putting it together
So far we have the two main parts of the algorithm. Given an offset, we can find the optimal mapping and, given a mapping, we can find the optimal offset. But how to start? If we don’t know anything about the sequences, we need an initial guess for the offset. This can be determined with the difference of both sequences’ means or medians. However, I chose a different approach.
I assume that we know the maximum offset of both sequences. Then we can tweak the
skipPenalty in order to force the matching algorithm to not skip an element unless it is necessary.
If all pairs have the maximum offset and no element is skipped, we have a total penalty of . Breaking up a pair introduces a penalty of (one for the input, one for the target). Now consider the following sequences with a maximum offset of 5:
This mapping has a penalty of 15. We could easily reduce the penalty to by skipping the first target, mapping input0 to target1, input1 to target2 and skipping the last input. But we don’t want that. So we need the skipPenalty to be high enough to prevent this. Therefore the penalty for breaking up a pair must be higher than the reduction of all other penalties from maxOffset to 0:
The number of potential pairs can be determined from the minimum of input elements and target elements. For squared differences, maxOffset hast to be squared.
That’s how to start. But how to end? Assuming that both sequences are aligned perfectly, we can define a threshold for the difference above which pairs should not be considered to be pairs. This threshold (divided by 2) is the last skipPenalty that we optimize for.
The skipPenalty is reduced from step to step from its start value to the target value. The reduction is performed exponentially, because very high skipPenalties should reduce more quickly than rather small ones. This leads to a refinement of the matching in every step. We define a constant number of steps for the refinement. After executing all steps, we have arrived at an optimal solution. We can directly use the optimal path’s penalty as a quality measure or we could introduce another one. However, other quality measures should use the assumptions made during the optimization (absolute or squared differences, penalty does not increase if a certain threshold is exceeded).
A Visual Studio 2012 C# solution with some test cases can be found here. The program is a console application that outputs the steps as images in the startup directory.
Here are some sample matchings that the algorithm produces:
Perfect shifted input
Shifting the input is the easiest scenario. The algorithm finds the optimal solution in the first step:
Noisy shifted input
Noisy input is handled equally well:
Skipped and added elements
The hardest test case is where some target points are skipped and the same amount of elements is added to the input sequence. Of course, uniform noise is applied to the input points: