Statistical Review of Code Analysis in Visual Studio – Part 1 of 2

[foogallery id=”97″]

The Setup

Microsoft’s Visual Studio is a primary software development tool. It contains a feature known as Code Analysis which calculates and produces (exportable to Excel) a set of numeric measures relating to various aspects of the program code that makes up a given software application.

The four variables in this Code Analysis output that are included in the data set for this project are:

  • LOC — Lines of Code — Simple count of non-blank lines in each code function
  • CY — Cyclomatic Complexity — Measures the structural complexity of each code function. Created by calculating the number of different code paths in the flow of the function. A program that has complex control flow will require more tests and thereby be less maintainable.
    • The formula is given as:
      • M = (E ? N + 2P) where:
        • E = Count of graph edges
        • N = Count of graph nodes
        • P = Count of connected code files
  • CC — Class Coupling — Count of the number of code files to which a given code file is “coupled” or linked, either through parameters, local variables, return types, related function calls, inherited code files, interface implementations, and other attributes.
    • A high coupling value indicates a design that is difficult to reuse and maintain because of its many dependencies on other code files.
  • MI — Maintainability Index — Calculates an index value between 0 and 100 that represents the relative ease of maintaining a given code function.
    • A high value indicates better maintainability
    • Since MI includes both CY and LOC in its formula, it is a known causal relationship that we will exclude
    • The formula is given as:
      • MAX(0, (171 – 5.2 * ln(Halstead Volume) – (0.23 CY) – (16.2 ln(LOC))) * 100 / 171)
    • Note: Unfortunately, the data for Halstead Volume is not available.

Variables & Descriptive Statistics

It would seem “conventional wisdom” among programmers that the more lines a code function has, the greater its inherent complexity (a.k.a., it’s CY).

This statement is backed up by the suggested programming practice of limiting a function’s LOC to a particular maximum length, as stated in the popular book Code Complete: https://books.google.com/CodeComplete.

The crux of this study is to explore and test this intuitive relationship.

Selected variables for analysis:

Independent LOC
Dependent CY

Note: Complete data is included for MI and CC as well, for further analysis.

Descriptive statistics for LOC and CY for our example solution:

LOC – Descriptive Statistics CY – Descriptive Statistics
Mean 33.87346135 Mean 9.498276711
Median 13 Median 5
Mode 6 Mode 3
Minimum 6 Minimum 1
Maximum 1452 Maximum 152
Range 1446 Range 151
Variance 4367.7756 Variance 193.4767
Standard Deviation 66.0891 Standard Deviation 13.9096
Coeff. of Variation 195.11% Coeff. of Variation 146.44%
Skewness 9.3752 Skewness 3.7931
Kurtosis 150.1559 Kurtosis 20.2346
Count 2031 Count 2031
Standard Error 1.4665 Standard Error 0.3086

Based on the above, some general conclusions can be made:

  • The coefficient of correlation is positive at: 0.587919083, so this is a good indication of a relationship
  • The covariance is: 540.4581562
  • There is a high amount of variation in both variables
  • LOC’s coefficient of variation is higher than CY’s
  • Both data sets are extremely right-skewed
  • Standard errors for both data sets are low
  • Sample size is large (2031 samples)
  • Ranges for both data sets are large
Skip to toolbar