[foogallery id=”97″]
The Setup
Microsoft’s Visual Studio is a primary software development tool. It contains a feature known as Code Analysis which calculates and produces (exportable to Excel) a set of numeric measures relating to various aspects of the program code that makes up a given software application.
The four variables in this Code Analysis output that are included in the data set for this project are:
- LOC — Lines of Code — Simple count of non-blank lines in each code function
- CY — Cyclomatic Complexity — Measures the structural complexity of each code function. Created by calculating the number of different code paths in the flow of the function. A program that has complex control flow will require more tests and thereby be less maintainable.
- The formula is given as:
-
M = (E ? N + 2P) where:
- E = Count of graph edges
- N = Count of graph nodes
- P = Count of connected code files
-
- The formula is given as:
- CC — Class Coupling — Count of the number of code files to which a given code file is “coupled” or linked, either through parameters, local variables, return types, related function calls, inherited code files, interface implementations, and other attributes.
- A high coupling value indicates a design that is difficult to reuse and maintain because of its many dependencies on other code files.
- MI — Maintainability Index — Calculates an index value between 0 and 100 that represents the relative ease of maintaining a given code function.
- A high value indicates better maintainability
- Since MI includes both CY and LOC in its formula, it is a known causal relationship that we will exclude
- The formula is given as:
-
MAX(0, (171 – 5.2 * ln(Halstead Volume) – (0.23 CY) – (16.2 ln(LOC))) * 100 / 171)
-
- Note: Unfortunately, the data for Halstead Volume is not available.
Variables & Descriptive Statistics
It would seem “conventional wisdom” among programmers that the more lines a code function has, the greater its inherent complexity (a.k.a., it’s CY).
This statement is backed up by the suggested programming practice of limiting a function’s LOC to a particular maximum length, as stated in the popular book Code Complete: https://books.google.com/CodeComplete.
The crux of this study is to explore and test this intuitive relationship.
Selected variables for analysis:
Independent | LOC |
Dependent | CY |
Note: Complete data is included for MI and CC as well, for further analysis.
Descriptive statistics for LOC and CY for our example solution:
LOC – Descriptive Statistics | CY – Descriptive Statistics | ||
Mean | 33.87346135 | Mean | 9.498276711 |
Median | 13 | Median | 5 |
Mode | 6 | Mode | 3 |
Minimum | 6 | Minimum | 1 |
Maximum | 1452 | Maximum | 152 |
Range | 1446 | Range | 151 |
Variance | 4367.7756 | Variance | 193.4767 |
Standard Deviation | 66.0891 | Standard Deviation | 13.9096 |
Coeff. of Variation | 195.11% | Coeff. of Variation | 146.44% |
Skewness | 9.3752 | Skewness | 3.7931 |
Kurtosis | 150.1559 | Kurtosis | 20.2346 |
Count | 2031 | Count | 2031 |
Standard Error | 1.4665 | Standard Error | 0.3086 |
Based on the above, some general conclusions can be made:
- The coefficient of correlation is positive at: 0.587919083, so this is a good indication of a relationship
- The covariance is: 540.4581562
- There is a high amount of variation in both variables
- LOC’s coefficient of variation is higher than CY’s
- Both data sets are extremely right-skewed
- Standard errors for both data sets are low
- Sample size is large (2031 samples)
- Ranges for both data sets are large