[foogallery id=”97″]

The Setup

Microsoft’s Visual Studio is a primary software development tool. It contains a feature known as Code Analysis which calculates and produces (exportable to Excel) a set of numeric measures relating to various aspects of the program code that makes up a given software application.

The four variables in this Code Analysis output that are included in the data set for this project are:

LOC — Lines of Code — Simple count of non-blank lines in each code function
CY — Cyclomatic Complexity — Measures the structural complexity of each code function. Created by calculating the number of different code paths in the flow of the function. A program that has complex control flow will require more tests and thereby be less maintainable.
- The formula is given as:
  - ```
  M = (E ? N + 2P) where:
```
  - E = Count of graph edges
  - N = Count of graph nodes
  - P = Count of connected code files
CC — Class Coupling — Count of the number of code files to which a given code file is “coupled” or linked, either through parameters, local variables, return types, related function calls, inherited code files, interface implementations, and other attributes.
- A high coupling value indicates a design that is difficult to reuse and maintain because of its many dependencies on other code files.
MI — Maintainability Index — Calculates an index value between 0 and 100 that represents the relative ease of maintaining a given code function.
- A high value indicates better maintainability
- Since MI includes both CY and LOC in its formula, it is a known causal relationship that we will exclude
- The formula is given as:
  - ```
  MAX(0, (171 – 5.2 * ln(Halstead Volume) – (0.23 CY) – (16.2 ln(LOC))) * 100 / 171)
```
- Note: Unfortunately, the data for Halstead Volume is not available.

Variables & Descriptive Statistics

It would seem “conventional wisdom” among programmers that the more lines a code function has, the greater its inherent complexity (a.k.a., it’s CY).

This statement is backed up by the suggested programming practice of limiting a function’s LOC to a particular maximum length, as stated in the popular book Code Complete: https://books.google.com/CodeComplete.

The crux of this study is to explore and test this intuitive relationship.

Selected variables for analysis:

Independent	LOC
Dependent	CY

Note: Complete data is included for MI and CC as well, for further analysis.

Descriptive statistics for LOC and CY for our example solution:

LOC – Descriptive Statistics		CY – Descriptive Statistics
Mean	33.87346135	Mean	9.498276711
Median	13	Median	5
Mode	6	Mode	3
Minimum	6	Minimum	1
Maximum	1452	Maximum	152
Range	1446	Range	151
Variance	4367.7756	Variance	193.4767
Standard Deviation	66.0891	Standard Deviation	13.9096
Coeff. of Variation	195.11%	Coeff. of Variation	146.44%
Skewness	9.3752	Skewness	3.7931
Kurtosis	150.1559	Kurtosis	20.2346
Count	2031	Count	2031
Standard Error	1.4665	Standard Error	0.3086

Based on the above, some general conclusions can be made:

The coefficient of correlation is positive at: 0.587919083, so this is a good indication of a relationship
The covariance is: 540.4581562
There is a high amount of variation in both variables
LOC’s coefficient of variation is higher than CY’s
Both data sets are extremely right-skewed
Standard errors for both data sets are low
Sample size is large (2031 samples)
Ranges for both data sets are large

Alex T. Silverstein @ Isenberg School of Management

One Student's Experiences at UMass Online

Category Archives: code analysis

Statistical Review of Code Analysis in Visual Studio – Part 1 of 2

The Setup

Variables & Descriptive Statistics