The transport sector has a torrid relationship with data. I’ve spent my career at the intersection of physical and digital infrastructure, but I’ve always struggled to explain to myself why everything is so damn hard! Can there be a simple explanation as to why the sector so often fails to capitalise on digital technology?
How about a simple formulation:
- Infrastructure is inherently chunky (e.g. you can’t build part of a bridge).
- Data solutions are inherently incremental (e.g. build an MVP, gauge user’s reaction, improve and iterate).
- Transport and Data struggle to live in harmony because it’s hard to reconcile a chunky industry with an incremental industry.
I will admit that this is a pretty crass insight, but doesn’t it feel intuitively right? I mean, doesn’t it? Transport operators are, rightly, run predominantly by engineers (and similar but inferior professions like Quantity Surveyors), often with immense pressure to demonstrate value for money and 'efficiencies’. Given the choice, they will generally invest in better physical infrastructure over data infrastructure. And even when they do invest in digital projects too often they treat those projects just like construction projects, expecting them to deliver a solution as a single long-lasting ‘big bang’ rather than iteratively adapt and improve over time.
Perhaps what we need are some case studies to illustrate the point. Let’s start with one that is near and dear to my heart…
Transport vs. Data… Round 1: we need less TRUST
Total Operations Processing System (TOPS)- the system that tracks train movements on the UK railway network- was first developed in 1968 by Southern Pacific Railway, and adopted by British Rail on a purpose-built IBM mainframe computer in 1975. Britain’s passenger trains are, on average, just under 20 years old. TOPS, by contrast, is well into its 50s, and I’d put good money on it reaching pensionable age. If it was a bit of rolling stock it would have been scrapped decades ago.
This is absurd. At the risk of stating the obvious, trains are (generally) far more expensive and longer lasting than IT systems. Trains have changed a lot over the last 50 years, but it’s fair to say that IT has changed a lot more (the two-year old phone that I am writing this on has roughly ~150 times more processing power and ~8000 times more memory than the IBM System 360 that British Rail initially purchased to run TOPS). I would love to see a train 150 times faster (or 8000 times safer) than 1970s rolling stock. The Thameslink rolling stock alone cost £1.6bil., I reckon that, conservatively, you could build all of Network Rail’s IT infrastructure from scratch for half of that (NR, feel free to DM me if you want to take me up on that offer).
So how is it that we’ve managed to keep our physical assets (mostly) up-to-date, but let our digital assets become so hopelessly outdated? And just as importantly why is it that when we do get round to replacing IT systems that they often arrive years late, over budget, and with a fraction of the promised functionality? These are questions that have bothered me for some time. Arguably I have even built my career in part by profiting off of doomed IT projects, or from creating unsustainable 'tactical' solutions. Let he who is without sin write the first LinkedIn comment.
Anyway, back on topic… the truly fascinating aspect of TOPS is how, rather being replaced, it has grown roots, becoming the foundation of a vast, business critical ecosystem. In the 70s the core “tell me where the trains are” functionality of TOPS was augmented by the “tell me how late the trains are and why” functionality of ‘Train Running Under System TOPS’ (TRUST) — gotta love an acronym within an acronym. TRUST and TOPS were then in turn integrated into a wide variety of sub-systems and bolts-ons, all of which either run on top of the TOPS code-base or rely on connections to it, including, but not limited to:
- Train Service Information (TSI) — “tell me what the train schedule is”;
- Passenger Operations Information System (POIS) — “tell me which bits the trains are made of”;
- Control Centre of the Future (CCF) — “show me a pretty picture of where the trains are”;
- PALADIN — “tell me where the trains were more than a week ago”;
- Rolling Stock Library (RSL) — “who even owns this train?”
- Integrated Train Planning System (ITPS) — “where shall we send the trains tomorrow?”
- Train Describer (TD) — “apparently other services want to know where the trains are too?”
None of these systems come with anything resembling a human-friendly user interface (except perhaps CCF- pictured below- which is kind of cool). And they all rely on some combination of containerisation, virtual machines, and a diminishing pool of legacy coders to keep them running in the present day.
TRUST is the acronym most commonly used in Network Rail and Train Operators. Typically when staff refer to TRUST they are really referring to all or part of that wider agglomeration of systems accessed through the notorious green Courier font screen that originated with TOPS.
The 90's brought privatisation and suddenly train delays (or rather the performance regime built to assign blame for train delays) became a multi-million pound business. This made TRUST/TOPS even more pivotal to the financial well-being of the railways. There were now substantial flows of money between Railtrack and Train Operators due to delays, which in turn a direct impact on profitability and crucially bonuses. Entire franchises have arguably stayed in the black due to the flow of taxpayer’s money from NR to the TOCs on account of delays, a bizarre form of indirect subsidy from the public purse which inevitably creates some profoundly perverse incentives. Most delays accrue to NR, which then funds the TOCs, who in turn have little incentive to help reduce delays despite arguably having more opportunities for efficiency. For example, hiring more train dispatchers at busy stations will often reduce delays, but it costs TOCs not merely due to additional payroll costs, but also in lost revenue from the delays that those staff prevent. Who loses out here? The travelling public, and the taxpayer.
Back to TOPS. To better 'explain’ delays (many articles could be written on the accuracy of that word in this context), TRUST/TOPS gained a Visual Basic GUI imaginatively titled TRUST DA (for Delay Attribution). So we now have acronym [TOPS] within an acronym [TRUST] within an acronym [TRUST DA]. TRUST DA, or TDA to those in the know, meant that for the first time ever delay attributors (yes, that’s a real job, more on that later) could interact with some records and functions in TRUST/TOPS through a GUI (e.g. using a mouse rather than the traditional TOPS function keys, and with a palette of more than one colour). However, this new interface didn’t mean that the system was up-to-date, instead TRUST DA only added functionality to TRUST/TOPS that was strictly needed to accommodate the staff required by the new performance regime, and didn’t re-architect the underlying TRUST/TOPS system at all. Moreover, as passenger numbers (and thus the absolute number of delays) rose relentlessly throughout the 1990s and 2000s, TRUST DA was shown to be lacking. Crucially, despite Delay Attribution being an exhaustively governed and standardised process, the new interface did not include much in the way of data validation features common to other Microsoft forms of the era like drop-down lists or check-boxes. Instead, TRUST DA relies on staff to input carefully structured free text for every single delay, errors in which can result in hours of work being disputed (by the TOCs or NR staff unhappy with ‘getting minutes’) and rendered void, or require lengthy corrections back in the TOPS green screen. This is why Network Rail employ whole teams to correct and defend challenges to the delays attributed by other Network Rail teams. What’s more, the 90’s VB interface doesn’t support any time saving features; no shortcuts, no automation. In desparation many attributes run parallel instances to TDA to save time, but this can result in terrifying bugs where entire days worth of delays suddenly disappear off the screen.
The consequences of this under-designed, rarely-updated software were substantial for the dozens of people using it every single day. By the time of my brief tenure as Head of Performance for South East route in 2017/18 (a role which, despite the name, focused almost entirely on delay attribution) the situation was punishing. Attribution staff now numbered around 60 in the South East alone, man marked by sizeable Train Operator teams, with a four-level dispute and escalation process ending in the semi-autonomous Delay Attribution Board (who adjudicate thorny questions like who is responsible for points that sit between the rail network proper and a train depot).
Delays piled up during the morning and evening peaks, with even the best attributors (who could churn through 100+ delays an hour) struggling to keep pace and often failing to meet the exacting quality standards as they worked at speed. Attributors work to a strict rota, with shifts picking up whatever backlog the previous shift left. This means that the night shift often work on delays that occur long before their shift started, and with a bit of luck a the quieter night shifts can at catch-up, trying to pick up the pieces of the chaotic peak times that preceded them. Even so, in part due to the capacity of the TRUST DA system, any unexplained delays will expire by default in the early hours of the morning as the system automatically wipes clean the slate before the next day. When this happens it results in Network Rail (and thus, ultimately, the taxpayer) accepting responsibility for all those unexplained and therefore paying out a small fortune of taxpayer’s money due to their responsibility for 'unexplained delays’. Further sums for errors and disputes discovered after the fact. The financial impacts were pushing attribution out of obscurity and into 'bonus affecting territory’ for senior management. In the autumn, when lots of little delays increase the attributors workload still further, I was begging our team of attributors to take as much overtime as they could manage in a vain attempt to stem the tide.
The introduction of delay attribution (and it’s corollary delay disputes) to the system marked the point where the human consequences of reliance on the life-expired TOPS/TRUST megatron started to add up. With millions riding on whether delays were assigned to Network Rail or to the Train Operators (and every delay needed to be blamed on someone), sizeable teams began to emerge that were paid to use the clunky, inefficient interfaces of TRUST and TRUST DA as a full time job. This is the human price of the transport sector’s inability to make peace with IT: people spending 50+ hours a week sat in the uncomfortable corners of offices, depots, control centres, signal boxes, day and night, tapping away repetitively on antiquated systems, entering, re-entering, and correcting data, filling the void evacuated by IT with endless hours of thankless work.
Broken TRUST: Lessons learned
Anyone who’s worked in the public sector has likely seen this phenomena of seemingly immortal, irreplaceable, but hopelessly defunct zombie systems before. Budgets are tight, and it’s usually a bit cheaper to append a new bit of functionality on to an existing system than it is to replace a system with one that meets all of the old and new requirements. This means that core systems like TOPS/TRUST tend to grow arms and legs. In a self-reinforcing pattern, every new appendage makes it harder and more expensive to replace the core system, and more likely that new functionality will be 'bolted on’. The new appendages will inevitably be built by different people from the core system, will use different languages and different architectures, increasing complexity and maintenance costs, as well as the likelihood of bugs, and making it yet harder and more expensive to actually replace the core system.
At the top of the article I posed the following theory, so how did it stack up in the case of TOPS/TRUST?
- Good infrastructure is inherently chunky — check! The core questions of “Where are trains, why are they late, and who is to blame?” are all pretty chunky.
- Good data solutions are inherently incremental (e.g. build an MVP, gauge user’s reaction, improve and iterate) — At first blush the story TOPS/TRUST seems to disprove this theory, after all, what could be more incremental than to build an ecosystem of software over half a century? But we need to avoid mistaking inaction and quick fixes for incremental development. There’s a big difference between evolving technology and technological inertia. For example, the design of modern data warehouses would not be completely unfamiliar to the 1960's and 70's developers of TOPS and TRUST. While the core logic of relational databases has persisted, everything else has evolved to be far easier to use, and far more powerful. If British Rail/Railtrack/Network Rail had consistently invested in periodic updates to the code, user interface, and platform of TOPS then it could still be an effective and viable solution because the underlying logic is sound. Instead, investment was deferred year after year. Even when circumstances dictated that money absolutely had be spent- for example with the introduction of the performance regime- the decision was always made to take the cheapest most short-term option, to just bolt on a new tactical appendage rather than to fix the underlying mess.
- Transport and Data struggle to live in harmony because it’s hard to reconcile a chunky industry with an incremental industry— People who work on the railways are massively dedicated to their work, and passionate about delivering a better service to the public. They work tirelessly to squeeze every last drop of efficiency out of Britain’s ageing rail infrastructure. But this relentless focus on making the infrastructure work better often blinds the organisation to the important contribution made by systems like TOPS or TRUST. IT and data can’t, on their own, make the trains run on time. And yet by relying on these aged systems Network Rail is denied the timely information that it needs to understand what is causing delays, and it loses millions of pounds every year in performance penalties, inefficient working practices, and unexplained delays. The great irony here is that one of the arguments you hear about why TOPS/TRUST has never been replaced is that “it’s too big of a job.” The possibility of incremental agile improvement and replacement of these creaking systems is never even considered (no strangler patterns here), and yet the very piecemeal nature of the ecosystem that has grown up around TOPS/TRUST is what their incremental replacement possible.
I struggle to quantify how unique transport is in this respect relative to either the broader public sector, or the private sector. It’s easy to glibly say “if a private firm’s profits counted on these systems then they would have been fixed long ago,” but that strikes me as an over-simplification for two reasons. Firstly, as we’ve established, Network Rail has had a sizeable financial incentive to more efficiently record and explain train delays for many years. Secondly, I think the combination of diffuse accountability, short-termism, and under-informed clienting fits a wider pattern in IT investment that is not limited to the public sector or transport (though perhaps it is most pronounced in our sector).
What I do know is that if we want our infrastructure to deliver value to the public, to be ready for a future of mobility-as-a-service and connected vehicles, and to adjust to crises like COVID, then we need to learn to build better IT, to leverage our data better, and to do so incrementally in small ways every day. Every day at the intersection of transport, IT, and data is an opportunity to slowly but surely work through our technical debt and a legacy of under-investment… or to perpetuate and repeat the sins of the past. We need less TRUST.