The incredible real life adventures of “Shortest-Job-First”

Published in

tedspence.com

6 min readApr 28, 2024

In which our author recounts a risky experiment with scheduling algorithms

In my 1980s-era college operating system programming class, one of the topics discussed was this: How do you handle a slow CPU when you have lots of tasks waiting to be executed?

Despite my lack of interest at the time in programming operating systems, this was a really fun problem. The (now-ancient) textbook ended with a discussion of an algorithm called OPT, or “The Optimal Scheduling Algorithm”, which was also known as shortest-job-first.

Using shortest-job-first, the scheduler would take the quick tasks and run them first. Presumably, this means all the quick tasks finish soon, and so the users with quick jobs are happy for a short wait. The longer and slower jobs would then run later, but the theory suggests that since the slow jobs usually take a long time those users wouldn’t notice the delay.

The book then concluded wistfully by saying that “In the real world, we don’t know how long a job will take, so we can’t use OPT.”

Almost thirty years later, I finally saw an opportunity to try out this fascinating algorithm. Let me tell you my story of optimizing schedules for a tax data business, how it almost worked — and how the ideas within the algorithm helped find a better solution.

The shortest lighthouse on the Oregon coast (Kirt Edblom via Flickr)

Inheriting a batch processing system

I began working for Avalara in 2012, and I was given ownership of the batch processing system that calculated tax returns. It was slow and it often took seven days to prepare tax returns — which was a problem, because legal deadlines meant we had to have these forms ready within ten days. That was cutting it dangerously close.

Obviously, the correct fix was to replace the batch processing system with one that didn’t use batches. I knew that was the right approach, because each calculation doesn’t really depend on each other one. A better system wouldn’t have a calculation for company B wait in the queue until company A’s calculation was complete.

After looking at the code, I realized a full replacement would take months or years to build. What could I do in the meantime to prevent the system from running out of time?

Monitoring the system

The previous owner of this software had built a little system to measure the number of jobs waiting in the queue. It was a single query run once per minute to measure the size of the queue:

SELECT COUNT(1) AS jobs_remaining FROM filings_in_queue

Using the statistics from this system, you could watch as the queue gradually made progress over the course of seven days. At times the queue would shrink rapidly, and at other times it would stall because it was processing a really huge tax return for a big customer.

What if we monitored the total size of tax returns instead? A small change produced this updated dashboard:

SELECT COUNT(1) AS transactions_remaining 
  FROM filings_in_queue fiq
       INNER JOIN transactions t ON t.filing_id = fiq.filing_id;

Now it was easy to notice that whenever the queue progress would stall out, it was working on a large tax computation. It was time to start planning some improvements.

The art of the possible

So I had a rough idea of the long term solution. My next step was to find a way to safely tinker with the system. Because we were using Microsoft SQL Server at the time, I took the bit of code that selected the next batch job to run and converted it to a stored procedure. Here’s roughly it looked like:

CREATE PROCEDURE sp_GetNextBatchToCompute
AS

SELECT TOP 1 fiq.filing_id
     , COUNT(1) AS transactions_in_filing
  FROM filings_in_queue fiq
       INNER JOIN transactions t ON t.filing_id = fiq.filing_id
       GROUP BY fiq.filing_id
       ORDER BY transactions_in_filing ASCENDING;

Although stored procedures are frowned upon nowadays since they introduce unnecessary fragility in complex systems, this had specific benefits for me. With a stored procedure, I could submit a request to the database administrator team to update the procedure without waiting days for a new build of the code to be approved. At the time, Avalara was a company that needed manager approval to change the running system, so this approach optimized my ability to tinker with scheduling.

This new stored procedure would select the smallest filing first. Since the number of transactions in a filing was directly correlated to the length of the processing time for that filing, this should be the optimal algorithm, shortest-job-first, right? I let it run and watched the dashboard.

A totally foreseeable tragedy strikes

Those of you with experience in complex systems know that our first guess never works out as intended — or at least, if my first guess had worked out, it probably wouldn’t have been interesting enough to write about.

After my change to pick shortest-job-first, progress on the queue completely stalled.

So what had gone wrong? I had overlooked a few factors:

Each filing computation had some amount of overhead.
Some companies made modifications to their filings during the ten-day preparation period, and they needed to be recomputed.
The types of company that restated their data multiple times over the course of the ten-day preparation period were often smaller companies with fewer transactions and smaller accounting departments, who needed time to adjust for errors in their data.

So what naturally happened was task starvation — the big companies whose large filings were ready to compute at the beginning of the ten day period were never selected.

Thankfully, due to my dashboard, it was pretty obvious how to fix this — let’s go for longest-job first! I submitted a request to the database administrator team to update the stored procedure to select using descending instead of ascending. The system now completed in three days instead of seven.

Of course it worked — but what does this tell us?

It may seem trite, but the lesson I had to learn here was that no single algorithm works optimally in all use cases. Some customers were going to restate their data multiple times, and other customers who were going to be perfectly ready on day one and never need additional help.

Because of this:

The original queuing order of “date submitted” was not optimal.
Selecting batch items by smallest number of transactions was not optimal.
It is likely that there was a more advanced algorithm that would be better. Perhaps an algorithm that combined information about the size of the transaction, the number of times data was restated, or the age of the request to recompute…

I tried some additional experiments to optimize the batch selection algorithm further. I tried figuring out a way to delay the computations of a company who continually restated their data, but nothing was a clear win over “longest-job-first”.

After another few days I stopped trying to tinker with this and just decided to spend my time working on the replacement system. When completed, Project Blackbird was a massive performance and accuracy improvement that took just about 11 months from inception to completion — but I’ll save that for another article.

What is it you’re optimizing?

The biggest takeaway from this experience was that academic correctness often has to take second place to the reality of a system’s behavior. You can do exactly the correct thing, but when reality throws you curveballs you need to be able to handle them.

Stored procedures aren’t recommended because they make it harder to enforce type-safety in queries and data objects. However, in this situation, organizational reasons demanded that the most effective way to tune the program was to put critical code into a stored procedure that could be updated with less organizational hassle.
I totally encourage everyone to use shortest-job-first if they have the ability to do so, but you also need to factor in the length of time a job has been waiting or to consider round-robin fairness in scheduling so that every company gets a turn.
If Project Blackbird had been a two or three month project, this whole experiment would have been a waste of time — in that world, it would have made more sense to rush to get it done sooner rather than spending time optimizing a slow batch system.

Of course, Avalara went on to be a huge success and is still a mainstay of the business world today. I’m grateful for the chance to have built and led development for its tax return computation system, automated filing systems, the AvaTax REST API, and more. But I always thought tinkering with these calculation queues was a fun project and I hope you find similar joy in your work.

Ted Spence heads engineering at ProjectManager.com and teaches at Bellevue College. If you’re interested in software engineering and business analysis, I’d love to hear from you on Mastodon, Threads, or LinkedIn.