Data cathedrals versus information bazaars

Posted on April 24, 2017April 23, 2019 by peterlawrence

Enterprises create data cathedrals with an enforced dogma to control data purity, causing much information to be outside its walls where informal information bazaars thrive. These information bazaars have suspect quality, uncertain provenance, yet are responsive to users’ needs. Metcalf’s law suggests that the benefit gained from integrated information grows geometrically¹ with the number of data communities that are integrated. How can we balance the dogma of the data cathedrals and the spontaneity of the information bazaar?

Enterprise’s database cathedrals reflect corporate dogma. Nothing gets changed without approval from high. Change is very slow. New databases orders get integrated only after a considerably long time assuming that the new data is 100% squeaky clean. So there are a lot of databases that are entirely outside the database cathedrals’ walls. Badly behaved sources of data might even be excommunicated.

Where does the other data go? It is not as though this other data does not exist, although many would like to pretend it to be so. Instead they are all in the information bazaar. Anyone with any information can set up their own information stall, and store their own data in Excel, Access, anywhere they want. They only specialize in their own data for their own use. This data is pretty good because that is all they need for their business. They share well with others but on a barter basis. In fact the information bazaar is chaotic, but lively, always changing to users’ demands, and a fun place to be.

Why do we have the conflict between the database cathedral and the information bazaars?

The data cathedral offers security, quality, and good provenance. It provides the system of record for users who then should have complete confidence in their decision making. It does this using accurate relational models capturing enterprise information. But a relational model is designed by the cathedral hierarchy based on the closed model: only pure data can be entered into the database; impure data can lead to excommunication.

The information bazaar has few rules of entry. As demonstrated by the web, it allows anyone to say anything about anything (AAA). Even with this deficiency we will regularly search the web to help us with our decision making, not exploring sources that are suspect, and filtering information that we feel lacks accuracy until we end up with information to support our decision.

Can we resolve these conflicting objectives?

Can we expect the cathedral hierarchy to relax its admittance criteria to let in as much of the information bazaar as possible? Somewhat, but we cannot expect miracles.

Can we expect the information bazaar to become more sober and responsible so that it can securely provide information with guaranteed quality and provenance? Somewhat, but we cannot expect an evangelical conversion?

Really this is not optimal, because the benefit of having data integrated grows geometrically with the number of interconnected sources, yet the database cathedral cannot grow because the information bazaar does not meet their purity dogma.

So how can these conflicting objectives be redeemed?

One path to redemption is to unite the information bazaar through a common semantic model. This allows all information to be available within a universal graph (model). Of course some riff-raff will get in, but again that is an advantage for the semantic model as you can also declare rules that will verify the accuracy of the data even though it is already stored.

At the same time the data cathedral can continue to expand, hopefully at faster pace, by integrating those graphs that meet their criteria.

However we allow users to access both the data cathedral, from where they can obtain the system of record, and information bazaar. We could even report results federating form the two data-sources annotating that information from the information bazaar with its provenance and hence less certain data quality. Doing this in a standards compliant way turns existing enterprise information resources into connectable, responsive and interoperable semantic assets.

Harmony

Using this approach we don’t need to force the data cathedral to relax its dogma, nor do we ask the information bazaar to shut down. Yet we can offer users access to 99% of the enterprise information providing users the ‘Metcalf’¹ benefits of full integration. As semantic assets grow and connect, they enable a resilient semantic ecosystem of meaningful interactions between people, applications and data irrespective of the differences in structures, data schemas, governance and technologies. The dividing boundaries between the cathedral and the bazaar no longer need to be obstacles to information users. Semantic ecosystem seamlessly embraces and provides integrated access to data cathedrals and information bazaars alike.

¹ If I have 10 database systems running my business that are entirely disconnected, then the benefits are 10 * K, some constant. If I integrate these databases in pairs (operations + accounting, accounting + payroll, etc), then the benefits increase to 10 * K * 2. If I integrate in threes, (operations + accounting + maintenance, accounting + payroll + receiving, etc), then the benefits increase four-fold (a corollary of Metcalf’s law) to 10 * K * 4. For quad-wise integration my benefits would be 10 * K * 8 and so on. Now it might not be 8 fold but the point is there is a geometric, not linear, growth in benefits as I integrate all of my information across my organization

R&D$=Code? A simple model of software development productivity

Posted on April 24, 2017November 15, 2017 by peterlawrence

If you do not plan where you are going you will not get there, but you will probably get what you deserve. This software development model, that accurately predicts resources and schedule given scope, can greatly help planning when you will get there and what will be needed on the way. This may not be a perfect model, but perfection is the enemy of the good. What software development model experience can you share?

In a previous posting Innovation != R&D$, I reasoned that the impact of R&D expenditure is not well correlated with gross operating margins because a significant proportion of R&D monies is absorbed by the less revenue generating activities of maintenance and support rather than innovation, by which I mean in this context the creation of new software applications. However it is still important to ensure that whatever monies are invested in innovation are wisely invested. Unfortunately our industry is beset with project delivery problems. We have earned the IT Rule-of-5: 5 times over-budget, 5 times schedule, 1/5^th functionality. Perhaps an exaggeration but there is a problem that few would deny.

One problem is that of creating unrealistic expectations. The development team’s answer to a feature request of ‘it is just a small matter of programming’ (one unit of SMOP) gets heard as ‘it will be on my desk tomorrow morning, tested and with a quality and quantity of documentation that would shame Charles Dickens’.

The answer is a good looking model. I love models. Not the type you are thinking of, but simple mathematical formulae that allow me estimate what will happen. Having been long involved with automation, MES, and software development I always want to know how long a software development will take. ‘As long as a piece of string’ is not the most useful answer when customer expectations or development budgets need to be met. So over the years I have developed a model that estimates total development effort and tracks development progress over the life-cycle of the project with surprising accuracy. Before you say that most models have so many ‘tuning factors’ that you can of course make it always fit, I want to point out that this model has just one factor. Also I want to point out that most of this model originates with Capers-Jones seminal work ‘Applied Software Measurement’

Estimating Model

The objective of the estimating model is to use a measure of the size of the development and come up with estimates for project duration, the number of project resources, total development effort and average productivity. From these estimates the project cost can be derived.

Tuning factor (J)

This is the only factor you need for this model. Fortunately Capers-Jones also provides a range of suggested values, as shown below. I would suggest 0.4 as a starting point.

Kind of software	Best in class	Average	Worst in class
Systems	0.43	0.45	0.48
Business	0.41	0.43	0.46
Shrink-wrap	0.39	0.42	0.45

Estimated Scope (LOC)

Estimated size for the project (lines-of-code). OK, it can be difficult to come up with a really accurate estimate, but there is much written on the subject. Quick ways are to simply say that this application is approximately the same size as a similar one done in the past. For greater accuracy, Function Point or Story Point estimating can be used and then converted to lines-of-code.

Estimated Function Point (FP)

= Lines-of-Code (LOC) / LOC-per-FP, where LOC-per-FP is taken as 54 for languages such as C#

Estimated Project Duration (D)

= FP^J (months)

The ideal project duration given the ideal number of resources for the project.

Development Resources (R)

= FP^2*J/27 (persons)

The ideal number of full-time project-persons to complete the project.

Total development effort (Effort)

= D * R (person-months)

Average Productivity (P)

= LOC/Effort (lines-of-code per person-month)

Below is an example of applying this model to an example project:

Development Estimate Model
Factor	Units	Formula	Example Project
Tuning Factor	dim	J	0.4
Lines of Code	LOC	LOC	75000
LOC/FP	LOC/FP	LOC-per-FP	54
Function Points	FP	FP	1389
Estimated project duration	months	FP^J	18
Development resources	persons	*FP^2J/27**	12
Total development effort	person-months	*FP^3J/27**	219
Average productivity	LOC/person-month	*27FP^(1-3J)*	343

The graphs below shows the project effort and productivity plotted for various sizes of project. As would be expected we can see the productivity falling off as project size increase (see Mythical Man Month)

Tracking Model

The Tracking Model recognizes that the assumptions in the original estimate might not apply in practice. For example the number of resources assigned to the project might change, or the scope decreases or, more likely, increases. Thus the tracking model uses known measurements of the project such as code produced to date, resources actually assigned, and current total project size to calculate what the progress to date should be and to predict into the future the revised project completion.

Estimated Scope (LOC)

Estimated size for the project (lines-of-code). This is the estimate at the beginning of the project because scope creep and scrope additions will inevitably occur.

Scope Creep (Creep)

Percent change month-on-month of the project scope (%). This requires careful measurement because, as has been shown elsewhere, projects can only tolerate a small amount of scope creep before they become ‘runaways’.

Estimated Scope including creep (ELOC)

= LOC * (1+Creep) + Any additional scope

Estimated Scope (EFP)

= ELOC / LOC-per-FP

Project Duration (D)

Duration of project, measured from the original project start, assuming optimal resource allocation, no scope creep, and sustained productivity

= EFP^J

Required Project Resources (R)

Resources that should be allocated to the project for the duration

= EFP^2*J/27

Actual Project Resources (AR)

Actual person-month assigned to the project for the period.

Estimated productivity (P)

Estimated lines-of-code per assigned person-month. Note that productivity reduces as the assigned resources increase (see Mythical Nan Month)

= ELOC*27*(27*AR)^{(1-3*J)/(2*J)}

Estimated Production (PR)

Estimated lines-of-code produced in period based on assigned resources and estimated productivity.

= P * AR

Accumulated lines-of-code (ALOC)

Accumulated lines of code to-date based on assigned resources and estimated productivity.

Remaining Scope

Estimated scope less estimated accumulated code

= ELOC – ALOC

Accumulated Cost (AC)

Cost of actual resources assigned based on annual rate.

Cost per Line-of-code

A metric indicating how much each line-of-code is costing

= Accumulated Cost / Accumulated lines-of-code

Agile Productivity Ratio

The promise of Agile/SCRUM is that productivity will increase as large teams as split into Scrum teams. As indicated before, productivity reduces as team size increases. This is the ratio of the productivity of a single team versus multiple Scrum teams with the same total resources.

Allowable scope change

Scope creep is a project killer. If each month the project scope is allowed to increase, then the project size will increase. If the project size increases, then duration and number of project resources follow. As project resources increase then productivity will reduce, further decreasing project duration. As project duration increases, then more scope changes can accumulate, further increasing project duration, and so on. The tipping point when the project becomes a runaway has been determined and is calculated as follows:

= 1/EFP^J

The graphs below show the results from a typical project. We see the project scope increasing (scope-creep), resulting in more resources being pulled in to tackle the retreating deadline. Note that without scope creep the project would have completed in month 23 even with the same assigned resource profile. This signals the hidden dangers of even mild scope-creep.

The above graph is based on the actual resources assigned. The graph below compares the estimated with the actual production of code over the same period, confirming the accuracy of the model.

Observations

Although there is only one factor in this model, estimates are quite sensitive to the value chosen. Therefore it is best to track the estimates with actual measurements to ensure the accuracy of the model and hence the tuning factor.
Include test code or not? If the development teams are using test-driven development (TDD) or any form of Agile/SCRUM then I think it is important to include test code along with production code in the estimates, at the same time including the test resources with the overall project resources. Generally I expect to see test code lines-of-code to be approximately 50% of the production code.
Counting lines-of-code should use the same tool, such as Visual Studio, for consistency. There is much debate about what should be included and excluded. However I think it is more important to simply be consistent because you will end up deriving your own tuning factor based on your assumptions.
There is a difference between development productivity, and productivity of customer expectations. Just because the lines-of-code have been efficiently produced, albeit error free and with great documentation, it does not mean that customer expectations have been met because they might have wanted an entirely different solution.
Scaling lines-of-code to function points. Despite the superiority of Function Points (or Story Points) over lines-of-code as a measure of software size, lines-of-code seems to be more tangible to management. Thus although the model is expressed in function points, I expect most will estimate lines-of-code or convert (‘back-cast’ to quote Capers-Jones) from lines-of-code to function points using a factor based on the type of programming language being used. For example there are approximately 54 C# lines-of-code per function point.

Associated Spreadsheet

To those who have got this far, I am sharing a spreadsheet version of this model that you are free to download and use for your own estimating. I hope it works as well for you as it has for me.

References

Applied Software Measurement: Global Analysis of Productivity and Quality. Third Edition; Capers Jones; 2008
The Mythical Man-Month: Essays on Software Engineering; Frederick Brooks;1995

Innovation != R&D$?

Posted on April 24, 2017November 15, 2017 by peterlawrence

It is known that software R&D expenditure positively impacts the gross operating margins or the market-to-book values of a company. However the correlation is not strong. Is it because maintenance, sustainment as well as innovation are lumped together as R&D, yet the return is not equal throughout the software product life-cycle. This articles shows that there may be far less of the high-returning innovation development than management believes. In fact innovation is being starved out by the need to maintain and sustain the existing software product portfolio.

R&D = Innovation + Maintenance + Sustainment

It is generally agreed that R&D expenditure positively impacts gross operating margins or the market-to-book values of a company[1], but how strong a correlation is hotly debated[2]. What is more difficult is the measure of R&D expenditure productivity, or return on investment. For example, if the majority of R&D expenditure is going into existing product maintenance then it is unlikely to offer the same ROI as investing into new innovations.

In their study Booz Allen Hamilton presents their results exploring the smart spenders of R&D. They question the degree of correlation between R&D and company financial performance: “There are no significant statistical relationships between R&D spending and the primary measures of financial or corporate success: sales and earnings growth, gross and operating profitability, market capitalization growth, and total shareholder returns. Gross profits as a percentage of sales is the single performance variable with a statistical relationship to R&D spending.”

However Booz Allen Hamilton assumes that R&D is limited to ideation, project selection, product development and commercialization, as illustrated below.

In practice we know that the R&D expense continues throughout the innovated product life-cycle to include sustainment and maintenance, where

Sustainment is the addition of new features to an existing product to maintain or gain market share
Maintenance is to ensure quality and hence customer loyalty.

As important as sustainment and maintenance are, R&D investment into these later phases of a product’s lifecycle is never likely to offer the same returns as investment in innovation, the more disruptive the better[3].

Maintenance and Sustainment is an inevitable consequence of Innovation

To truly evaluate the return-on-investment of R&D expenditure we need to distinguish between the R&D investments at different phases of the product life cycle as they surely offer different ROI.

So what is the typical distribution of expenditure for software products? A simple model, verified by actual observations, reveals some surprises.

Sustainment (adding new features) cost is 8-15% per annum of the original development cost and accumulated sustainment investment to date in any supported code. This 8-15% creates additional code that needs to be sustained and maintained in the future. For example if the original development was $100,000, budget $8,000-$15,000 per annum to add new features requested by customers in order to sustain a competitive product. Note that this would lead to a doubling of the code base over the typical 7-year life-cycle of a product.
Maintenance (providing bug fixes) cost is 8-15% per annum of development investment to date in any maintained product code. For example if the original development was $100,000, budget $8,000-$15,000 per annum for maintenance in the first year but expect that to grow as sustainment investment increases the code base to be maintained.

So lets us apply this model to a start-up company that has decided to invest $150,000 per annum to create their new product. For the first few years this model works well as the sustainment and maintenance costs are relatively minor. However after a few years the sustainment and maintenance costs are starving out continued innovation until in Year 7 when there is scarcely any innovative development at all. Does this look familiar to you?

No wonder the ROI of R&D does not correlate well; it depends where a company is within this cycle. Code created for the initial release (innovation code) allows one to capture new markets or market share, which is surely more valuable than code added to supply additional features (sustainment code) to ensure that customers are satisfied, and that the product retains a competitive position, which is more valuable than code added to fix problems (maintenance code) to ensure quality and hence customer loyalty. Unfortunately the high-performing innovation investment reduces to less than 12% of R&D total. Over the life-cycle of a product, accumulated sustainment and maintenance can be 180-600% of the original innovation investment.

One could argue that a company which had a successful innovation would be growing, so its R&D budget would be growing proportionately. However, even if we modify our investment strategy and decide to maintain the innovation investment at a constant level, innovation as a percent of R&D would be reduced to 35% by Year 7 as shown below:

How to solve the Innovation Dilemma

Starvation of innovation is caused by the need to sustain and maintain existing code. Therefore, aside from making the innovation investment more productive, the following are suggested strategies: Deprecate old products as soon as possible

Old products do not freely sit on the shelf. They are like volcanoes that have not erupted for some time. Are they extinct? It could be less profitable for that old product to be sold since that will then perpetuate the sustainment and maintenance. However many careers may be wedded to these old products so they become very difficult to kill.

Minimize the code investment in the original product (Lean!)

If you own a larger house, then sustainment and maintenance inevitably costs more. Code is no different. Thus any opportunities to downsize that code yet still meet the functional requirements will reduce the long term sustainment and maintenance costs, releasing more R&D spend for future innovations.

Code that is created quickly and efficiently allows the product to be released to the market earlier, ensuring an increased internal rate of return or net present value of the investment.
Code that adds functionality to solve customer problems is likely to be more valuable than core component code that could be purchased from OEMs.

Capitalize code investment

This will probably make accountants pay attention, but expensing code as it is created disguises the fact that it really behaves like a capital investment; it will need sustainment and maintenance investment over the years to retain its value.

[1] Hall, Bronwyn, Jacques Mairesse & Pierre Mohnen, 2010, Measuring the Returns to R&D, in: Hall, B. and Rosenberg, N, Handbook of the Economics of Innovation, Elsevier, Amsterdam, pp. 1034-1076

[2] Booz-Allen-Hamilton. (2006), Smart Spenders: The Global Innovation 1000

[3] The Innovator’s Dilemma. Clayton M. Christensen Cambridge, Massachusetts: Harvard Business School Press, 1997