Software Architecture Notes

Large Projects

I worked for a software consulting firm that got its start in the early '90's in a city on the East Coast. After a rocky start the firm found a unique selling message and delivery approach. It grew faster than 50% per year in the following years. Offices were opened across the U. S. and around the world. The company went public and lots of people got rich. And this was as the Internet boom was ramping up.

As the firm grew, it sold more and more larger and complex projects (multi-million dollar contracts with multi-year durations.) The big projects helped keep the revenue machine running and looked like they would help grow margins.

After a while the larger projects started to hurt. They ran over. The customers were unhappy. The teams were crushed trying to do the impossible. Sometimes there were lawyers. The firm's way of doing business was not working as well for large projects.

I was part of a team that tried to figure out what was going wrong and how we would have to change if the firm wanted to stay in the large project business. It would be great if there was a clean breakthrough that clearly solved the problem. Unfortunately, that is not how this story ends. While some regions were able to adjust, the firm foundered and went down for a number of reasons. It sank before we could know if the things we thought would work could be really widely proven in the field.

Here is an overview of what we learned. Even if the best that can be claimed is a partial success, some of this may be useful to firms or project teams that are wandering into the large project jungle for the first time.

What Hurt?

There seemed to be four big buckets for things that went wrong:

The amount of detail exceeded what traditional practices could handle.
The amount and shape of project uncertainties were unmanageable
Creating and maintaining a sense of partnership with so many client people and organizations for so long fell apart
Traditional project risks -- bad things that happened that were outside of the project team's control.

Traditional project risks were simply 'scaled up' from smaller projects, and traditional risk management techniques could be scaled up to handle them. But in the worst cases the first three simply broke the way the firm did business:

Details, Details

The firm started in Boston. Many of its early hires came stright out of MIT. The firm attracted them by concentrating on leading-edge technology projects. It was part of the culture that not only were we the best at the newest things, but we were generally smarter than any competing firm. And this was not entirely wrong. So, the firm's way of doing business was to hire the best and brightest and provide them with a minimum framework to work together in. Projects were small, people were very bright generalists, and because everyone was so good, there was no need for a lot of process. It 'got in the way.'

Projects managed their details 'in head.' That is, individual memory or the long-running conversation that was the project's culture were enough to keep track of anything important. While the traditional process did include document deliverables, these were for all practical purposes "write only." There was no expectation that they would be kept up to date enouogh to help the team remember anything important.

On large projects the team simply could not keep track of the details, no matter how skilled the individuals were. Important things got agreed to and forgotten. Details that everyone on the team should know about no longer reached everyone when they needed to know them, partly because there were no so many more people. The simple mass of detail overwhelmed the team's memory capacity and internal bandwidth.

Large projects were also more likely to be replacements for existing systems. The scope of the project included everything the old system did as a minimum. But there was no agreed-upon way to inventory the details of how the old system worked so that it could be truly understood as part of the agreement about the new project.

Uncertainty

The delivery approach was rapid delivery/fixed price/fixed time/floating scope. A project would be sold in multiple contracts. A small one at the beginning set the general scope. A medium one next to verify the scope and explore the design. Finally, the big one at the end to do the actual delivery. This way the client could get an idea of what they were in for before they committed. And the firm could set and manage expectations before it committed to a price for the big contract at the end.

The advertising was that one of the main deliverables of the scoping contract was the price of the design contract. And one of the main deliverables of the design contract was the price of the development contract. But clients needed some preliminary indications of the price for the final contract as early as possible. For smaller projects it was possible to give a relatively wide range estimate as early as the scoping contract. If the price had to be radically adjusted after the design contract, well, we learned new things and the price was generally small enough that even a big swing did not break the client's budget.

On a small project the uncertainty in the end price could be a factor of 2 to 4 times at the end of the scope contract. While that was a wide range, it could be managed by adjusting scope, compressing margins, contingency planning and good ole overtime. Basically the same process applied to a larger project resulted in uncertainties closer to 10 times. This was more than the firm could compensate for with its traditional measures.

Further, the larger projects with their larger prices caused clients to want more firm numbers earlier in the process. The numbers would be used to set client budgets for the next year. Budget numbers that had gotten approved 'up the chain' were harder to adjust.

Working with, Working for, Working against

The paradigmatic project for smaller projects was a high return, cutting-edge departmental solution. The best sponsors were visionary CTOs or technically oriented executives. They wanted as much return on their investment as possible. Time to delivery was often crucial. They wanted a technology partner. In exchange for speed and guarantees of price and time delivery they were willing and able to negotiate large swings in the projects definition as needed to meet the true business goals.

A technology partnership with a key executive worked much better on a small to medium project than on a large project. The audience for larger projects was more diverse. More people at more levels in the organization had to sign off on decisions. The scope and design documents that were intended to indicate a starting point for negotiation during the delivery became more and more contractural. More people and more departments made it hard to get real agreement on the intent of the system. Important and signficant changes could not be made on a timely basis, if at all.

Small to medium projects could be 'working with' the client. Larger projects degenerated into 'working for.' When the project needed to change faster than the relationship allowed, it got worse. It turned into 'working against.'

Symptoms

These things touched every aspect of large projects. This section ties symptoms to variations on the root causes. Its more fun to read about the solutions, so you might want to jump forward to that. For completeness sake, here is an overview of the root cause analysis:

Project Mechanics

Estimation

Problems

Delivery projects were bid at such a low price or for so short a schedule that the project team was doomed from the first day.
Early price and schedule indications would need to be radically increased, surprising the client, leading to significant push-back from the client and damaging credibility. This could use up much of the project's goodwill before things even really got started.
The project team responded to the pressure with commitment and large doses of overtime. Unfortunately, when the project was not saveable this mostly led to rework, poor quality deliverables, and staff burnout.

Tie To Root Causes

Insufficient detail was collected and understood to create an accurate estimate.
Large project estimation was done ad hoc by the project team. Deliverables across team were inconsistent and could not be compared to historical actuals to anchor estimates in reality.
Premature indications were given to the client, setting unrealistic expectations before uncertaintiies could be wrung out of the scope definition. Even if more accurate estimates were created after more details were collected, the price could be hard to 'bargain up' from the initial expectation.

Scope Control

Problems

Clients would assume that we had agreed to more scope than we understood. The language used in the scope and design meant different things to them than they meant to the project team.
Clients would assume that we had agreed to greater depth of functionality than we understood. This could be a simple thing like whether the title for a person would be a simple text entry field or a context-sensitive drop-down list. It could be a lot worse than that, and often was. 'Compliance to corporate user interface standards' was a classic.
The exact definition of the scope depended on who in the client organization was in the discussion.
The legacy system that the new project replaced kept generating surprising new capabilities that were never anticipated in the original scope

Tie To Root Causes

The project team could not acquire and understand enough of the details of what the client was trying to tell them during the early discussions.
There was no regular agreed-on mechanism to identify the depth required to deliver a function.
Because we did not have a true partnership with all of the involved parties, trade-offs between needs could not be done effectively
There was no way to examine an existing system to be replaced and fold its details into a managed project scope.

Project Planning and Tracking

Problems

For scope and design contracts, even though the work was not complete, the contract would be closed out so as to remain 'on time.' The assumption was that any leftover work could be 'caught up' in the next phase.
Work not on the original project plan was not tracked
Skills analysis was not done against the project plan. Needed skills were not identified in the plan, nor could the plan be adjusted based on different skill levels available when the team was assembled.
Accurate progress toward completion was not tracked.

Tie To Root Causes

The inability of the team to remember and communicate project details meant that the project plan often was assembled missing key work items.
The plan often had too little detail, but sometimes it had too much, but false, detail.
What tasks there were did not have the details of their completion criteria
The inability to manage the change in details meant that the project plan simply became irrelevant because it was ignored, progress against the plan was not tracked, the plan was not updated as conditions changed, or the task details were wildly out of synch with what an individual performing the task was actually doing based on the skills they brought to the table.

Quality Assurance

Problems

Defect counts for subsystems entering system test were very high
It was difficult to complete system test
Code had poor design qualities : it was poorly structured, hard to test, hard to change, and was inconsistent in things like error handling, use of the architecture, opportunities for reuse and GUI style.

Tie To Root Causes

Project teams varied widely on their definition of quality assurance.
QA expectations on developers were inconsistent within and across project teams
The client's quality goals and how they would know they were satisfied often went uncaptured.
Teams often lacked the skills to identify and use QA tools
Missing details in the project plan meant that the entire project 'squeeze' ended up in the QA phase.
Testing knowledge was not widely disseminated
Design for maintainability and testability were not widely understood

Subcontractor / client deliverable management

Problems

Clients would fail to provide deliverables in a timely way, delaying the project schedule
Client employees that were to be part of the project and then stay with the project after the firm project team left would not be provided on time or at all.
Client deliverables, when they arrived, would be at the wrong level, be factually incorrect, of low quality or incomplete.

Tie To Root Causes

Details of what was expected of client deliverables were often missing
Client task definitions lacked detail to identify what the pieces were and how they tied into the other project deliverables.
Client employees were often working on the project part time. Lacking a clear partnership with the client, the priority of project deliverables versus other client deliverables was not clear.

Execution

Maintaining Momentum

Symptoms / Problems

Long projects with long phases hurt momentum because the participants were drained of energy, the time between deliverables obscured the rate of progress, and the lack of clear progress between phase boundaries gave client employees who were opposed to the project time to work against it without counter-evidence of progress. All of these worked to erode the sense of partnership between the project and the client.
The large and spread-out audiences were harder to manage. It was hard to keep everyone involved up to date and positive on the project. It was also harder to create a cross-enterprise consensus on the project goals and changes that had to be made.

Solutions

There were three things that needed attention:

Methodology
Roles
Skills and Training

Methodology Changes

The methodology needed to be fixed in each of the three large project problem areas.

Detail Handling
Center on Risk and Uncertainty
Improve Partnering

In these days of Agile methods the specific phase recommendations seem more than a bit prehistoric. Of course we did not know about Agile then. And we were trying to maintain a fixed price/fixed time model. That also looks kind of silly now. But, for the record, the original methodology of scope, design and develop was augmented with new phases and changed these ways:

Pre-Scope

This covered any pre-sales or other pre-scope technical work. Lifecycle selection was begun -- is it a large project or not? Is it a project or a program?

For large projects, this was where the escalation path was laid out. Senior people from the client and firm side met, if only briefly, so that their first meeting wouldn't be when something had gone wrong.

Program Scope

When it looked more like a program than a project the first step was to spend a few weeks to figure out what the projects would be and how they would fit together. Driving dates for each project were identified. Other scheduling constraints across the program were identified. Bids for project-specific scopes were constructed. Any legacy systems to be replaced were identified. The core program team would be brought on board for the program scope, and would stay on the program through the first project delivery.

Project Scope

The traditional two weeks project scope was upgraded for large projects. The functionality list got a function point budget. The business case was used to create a feature filter to help figure out how functionality details would get decided. Open points were clearly enumerated, along with their impact on estimation accuracy. The process for driving the estimate from the functionality was agreed to with the client, including different functionality expansion assumptions: usually 20%, 40% and 60% expansions. An explicit internal overtime budget was set to prevent problems from getting buried in unrecorded overtime. A preliminary domain model was sketched out. Finally, the bid for the next phase was prepared, with the addition of quantifiable client actions, impacts and remedies.

Legacy Scope

If a project meant replacing an old system, then a legacy scope was about finding out what the old system did. All the documentation describing the old system was gathered : user manuals, help systems, code, table definitions. A conversion approach was defined. Can we do any automatic translation of the old system code? How much is a rewrite as-is in a new technology? How much really needs to be re-specified and re-designed? The legacy scope continued until the old system was known at the same level of detail as the project scope described the new system.

Architecture Workshop

When there were new technologies involved, or the system architecture was uncertain, an architecture workshop detailed out a candidate architecture. The goal was to show how the scoped functionality would map onto the architecture. An architecture workshop would often be followed by an architecture proof of concept, where pieces of the candidate architecture would be put together and measurements taken to be sure that it could do the job.

Architecture Build

Stealing freely from the Rational Unified Process, this phase build the core architecture and demonstrated a thin slice of functionality that touched as much of the architecture as possible. This proceeded in parallel with the phases that were detailing out the rest of the user functions. While the architecture would be elaborated, completed and tuned in later phases, the basic plan was in place after the architecture build.

Legacy Requirements Recovery

Legacy requirements recovery build on the legacy scope by reversing out the details of the legacy system. Each bit of functionality was mapped to the architecture defined in the workshop. These detailed functions were added to the design of the new project to produce a single design, rather than one design for the new system and another design for what the old system used to do. Not only was this a new kind of phase, it was an entirely new skill set for the firm.

Functional Specification

The old design phase got a whole bunch of new stuff. Three subphases were added.

The first was a GUI prototype. This let the users get a feel for how the system would work and confirmed user metaphor and experience. The second nailed down GUI details in a style guide. This covered common screen behavior, controls placement, use of bitmaps and any 'fancy' controls and finallized the navigation model. Finally the functional specification itself got ground out: an 'outside only' GUI implementation, a user manual describing what the system did behind the scenes, and a technical document with the rest of the details such as the domain model, business rules, build strategies, etc., for the project.

The functional specification was function point counted. This count was compared against the count from the project scope. If the counts were very different it made it clear that the project at the end of the functional specification was something different from what it was at the end of the project scope. That, combined with the shared model of how functionality was turned into costs, allowed the client and the firm to negotiate pricing and scope changes before moving on to implementation.

Implementation

The original "big bang" implementation was converted into a coarse grained incremental delivery, again heavily influenced by the Rational Unified Process. Each increment had its own sub-phases:

Detailed Design created detailed implementation models, the test approach, and generated test data.
Code produced working code and tested it against the test data using the test approach from the earlier subphase. Objective exit criteria let the team know if the code was finished or not.
Increment test was basically a User Acceptance Test for the increment. Defined time-to-failure or other quality metrics were used to determine when testing should end.
Increment delivery a cut of the increment was delivered to the user. Where a 'model office' was established, a subset of users would use the new functionality to do day-to-day work.

Hand-Off

This started with the User Acceptance Test for the final increment. It ran through any system test, installation, optimization and first run fixes needed to get the project into production and handed off to the client.

Project Management

Project management got upgraded too. The budget for the unplanned had to be managed, including client deliverable slips. Weekly reviews of where the project was expected to be compared to where it was were held as a Statement of Work (SOW) review. Both functional and technical learning curves had to be explicitly managed.

New Roles and Role Changes

There were new and improved roles to go with the new and improved methodology.

Business Analyst

The project manager and technical lead roles were well established before getting into large projects. But the idea of a business analyst needed to be formally articulated. There had to be a single first go-to person when the question was "How should this work?" from the user's standpoint.

Technical Architect

The person or people defining the largest grain patterns of the system and software. A person ready and able to buy system and software parts that worked together and reduce the project's cost. This person was also charged with identifying and negotiating functionality trade-offs that would have substantial time and cost savings.

Domain Modeler

Created a domain model of the business problem, without tying the team down to an implementation class model or database table model. The Domain Modeler worked with the Business Analyst to codify system functions against the shared model of the business. The domain modeler worked with the technical architect to identify internal opportunities for reuse.

QA Manager

The QA Manager was in charge of getting the testing resources together and making sure that testing was happening. They managed test preparation and execution, and tracked the repair status. While they worked with the project manager, they did not report to anyone else in the project. Instead they had a direct reporting relationship to the regional delivery management, so that troublesome quality issues that might otherwise get hidden inside the project could be made visible.

Testers

Testing had to be broken out as a separate project role. Testers could be cross-trained developers. They were trained to do unit tests, system tests of various kinds, and code reviews. They were brought on the team early enough to get the domain knowledge to make them effective during testing.

Developers needed to have their testing skills upgraded. It had to be made clear what testing was expected before code could be finished. Tools and training to support that had to be provided.

Parts Scavenger / Toolsmith

The parts scavenger was in charge of knowing what reusable pieces were lying around the firm. They were also good at going out and finding a component to buy when that was the right thing to do. They had lots of input to the GUI design so that the project could get the best advantage of available pieces. They also bought or built high-leverage productivity tools on an as-needed basis.

New Responsibilities Above Project Management

A new client-side role was needed above the project manager. They were responsible for:

keeping client consensus in place.
Maintaing the business case and the project's alignment with the business case
Executing risk mitigation plans, especially with the client side executive sponsor
Reflecting trade-offs of client behavior back to the client

Skills and Training

The changes to the methodology and roles implied significant new skills and training would have to be made available. They included:

Client Facing Skills

Enterprise Consensus Building and Sponsorship Leverage
Business case traceability / impact analysis
Client deliverable management
Risk identification and management

Project Leadership Skills

Basic project management skills brought up to a common level
Project Estimation
Change Management
Quality Assurance Management
Conflict Management

Technical Specialities

Domain Modeling
Function point counting
Legacy system requirements recovery
Usability design / GUI standards
Multi-model "balancing", especially for the technical team lead
Technical Architecture

Basic skills - things every consultant should know

Detailed Design
Testing practices and tools
Individual estimation abilities -- how to know what it will take you to get it done
Improved common tool use.
Improved Office software usage to prevent document screw ups
Meeting facilitation (every senior and above)

Other Impacts

Other things would happen to the firm as we did more large projects, and not all of them were good. The large number of skills to be acquired and maintained implied specialization. This was not popular in a firm that believed a small team of generalists could solve any problem worth solving.

There would be a lot more detail floating around in projects, and the oral tradition was going to have to take a back seat in some cases. All of this meant more "formality" -- not popular.

It also meant longer project timeframes. It followed that advancement into senior techncial roles and management would not come as fast.

Wrapping Up

So, is this the way to do large projects today? Clearly not. Agile has demonstrated that formality and rigor are not the only or the most effective ways to make software happen as projects get bigger. Also, one of the goals was the resist changes in cost and schedule. Agile changes the focus to embracing change in the most cost-effective way possible. That adds a whole new spin to things.

So, is this all a total waste then? Take away the things that might be better done by being a bit more cutting edge and there are still things to think about:

How much information is available when to set the client's expectations of scope and budget? How will those expectations be managed as uncertainties are resolved over the life of the project?
Agile structures the project's discussions, making the project able to handle radically more detail than a non-Agile project. What are the upper limits of this?
In the absence of a single, totally available and empowered customer, how does the team resolve the different client views and opinions within the project?
Is there any need for a Domain Modeler, separate from or as part of the skill set of a Business Analyst? Do Domain Models help a team handle even bigger projects, or do they just get in the way?
How much trust is required between the client and project team? How is that trust created and maintained?
Who manages the client deliverables into the project, and how do they do that?
When a new project replaces an existing system, how is the old system functionality discovered, scheduled and budgeted for?
How is project momentum maintained with the client and the project team if there are long delays between production installs?
Is risk management a separate activity worth doing? If so, who does it?
When replacing an existing system, how do we decide which functions should be prioritized above which others, and which have more business value? After all, 'everything' has to be replaced before the end of the project?
Is Scrum everything we need for a program of projects?
Where does the architecture come from?
How will we know if we have tested enough?
How do we manage skills needed on a project and those that individuals want to acquire during a project?
Does the project leadership and the rest of the team need explicit training in conflict management and resolution?
Are there other specialist roles that need to be defined?