Finally, a class with a familiar subject: analyzing data. The methods were kept fairly high level, as this was a class for business students, not engineering students. I certainly appreciated that, as I only passed engineering statistics with the help of my now-late ex-husband. The tools were kept to the ubiquitous accessible Excel, with the addition of the Data Analysis and Solver add-in tool packs. Perhaps a more traditional semester course — not these 7-week condensed versions — would touch more on Tableau and/or PowerBI.
As anyone working with Big Data can say, business people love data. The more, the better. The trick is to sort the signal from the noise and to display it in an appealing visual manner.
We can analyze data for different reasons: summarizing columns of numbers into pretty charts, detecting patterns and relationships to predict the future, or optimizing far too many input options into a suitable solution. This usually leads to a straightforward data model, wherein we have the data, a set of semi-random inputs, and a set of decision point variables. The closer the model is to reality, the better the results of the model.
Did you know that risk is associated with the consequences and likelihood of what might happen?
Shocking news, I know.
Worth quoting is Peter Drucker, 1974:
“To try to eliminate risk in business enterprise is futile. Risk is inherent in the commitment of present resources to future expectations. Indeed, economic progress can be defined as the ability to take greater risks. The attempt to eliminate risks, even the attempt to minimize them, can only make them irrational and unbearable. It can only result in the greatest risk of all: rigidity.”
No wonder business leaders pat us security folks on the head and send us on our way when we go in and yell about this or that risk. They accept risk every day. What’s one more?
To them, the act of not patching a high-risk vulnerability or not enabling MFA runs the same level of risk as committing resources to a new product line or betting that consumers want more AI in more products. This is also why, sometimes, security won’t win. Assuming security concerns are added to the same list as business risk, the risks still have to be prioritized. It’s our job to convey the security risk in a way that relates to the rest of the business risks in an apples-to-apples kind of way.
It is well worth the time to become familiar with Excel functions. In no particular order: VLOOKUP, HLOOKUP, INDEX, MATCH, and CHOOSE. Match the function to the question at hand, and the answer will pop out. Nested functions replicate basic programming logic — no macros required!
We also worked with the dreaded PivotTables. They’re not that bad, I promise.
Armed with tallies and averages over a range of time, one can make a pretty-pretty chart. If you’ve seen an organization’s cyber maturity levels in different security areas on a spiderweb, that’s a radar chart. Excel can also produce sparklines, which can summarize a row (for data over time), column (for categorical data), or win/loss (data that moves up and down over time) in a single cell.
It’s worth checking out the options Excel can provide before requesting licenses for Tableau or PowerBI. Get the most value out of the tools you have before adding more costs.
Much has been said about security measures and metrics. We certainly have the data to analyze. Identifying the bottom quartile of failed phishing tests can help drill into why those people failed those tests. Given a histogram of vulnerability dwell times (How long does it take your organization to patch a critical vulnerability?), accounting for variation, skewness, and kurtosis, we can determine how effective our patch management program is, and how at-risk the organization is from any given critical vulnerability. Throw in some confidence levels and the Known Exploited Vulnerability database, and we can say things like: We assess with 99% confidence that our patch management processes offer sufficient protections to newly discovered vulnerabilities. Or: Given our patch management processes and threat intelligence, we are 99% confident that the organization is at a low risk of attack by Volt Typhoon.
That’s something the business can get behind.
Hilariously, considering the text rarely shied away from calling statistical functions what they are (Bernoulli distributions, chi-squared goodness of fit, etc …) the name “Bayes” is never mentioned in the section on probability. Oh, the formula was there with a discussion on conditional probability … but not the name. Makes me wonder what Bayesian statistics did to the authors.
For security types, probability distributions play right into the notion of “likelihood”. Given the industry, the maturity of the security program of an organization, and the whims of threat groups, how likely is it that the organization will be a target of ransomware? It’s fun to say 100%, everyone is a target for ransomware, but remember, the business thinks of risk differently. Perhaps we need some more quantifiable data to back our claims of falling skies.
Naturally, it’s tough to get data over an entire population of all the organization’s assets. Interpreting vulnerability scanner dashboards is as much art as it is science. Which machines have checked in recently? How much shadow IT exists? Is that vulnerability a false positive? There are methods of getting a good sample size that will generate good analysis, which leads to good decisions. Sample errors are bound to happen, the same way shadow IT will never disappear.
Let’s say our working hypothesis is that expanding the external attack surface increases overall organization risk. Is that actually true for our organization? One way to find out is to act like the scientists we are and test the theory. Will this new project have a statistically relevant effect on the organizational risk level?
From our data lakes, we can pick a particular parameter we feel best represents the organization’s risk level. We can set up our experiment and test the theory, accounting for the possibilities of false positives and/or negatives. Statistical inference allows us to get a statistically significant conclusion to our question. Repeat for various parameters, and suddenly we know what actions have statistically relevant effects on our risk.
Hey, if it were easy, we’d have these problems solved by now.
We can also look at historical data for trend lines and regression analysis. The Verizon DBIR does a great job of tracking trend lines across attacks and industries. Now, regression analysis comes with a pretty big warning label: thou shalt not make predictions over ranges not in the underlying data. Again, Verizon does a great job at data collection, but think before you grab any old number to use as representative of the industry at large in our hypothesis. Check out their VERIS Community Database to help estimate risks.
Historical data gives rise to predicting future data, or forecasting.
So how do we separate the signal from the noise? Data mining techniques! We can cluster similar data together, classify new data according to historical data, find natural associations, or look for natural causes-and-effects. The distance between two points is measured with Euclidean distance, aka the hypotenuse of a right triangle.
Apologies to Ms. Eby for thinking I’d never need to use geometry in my career.
There are various methods of placing data on a common plane for apples-to-apples comparisons. They include single, average, average group, and complete linkage clustering. There are also Ward’s hierarchical clustering and k-Nearest Neighbor algorithms. Again, match the problem to the algorithm.
Once we know the parameters of a New Thing, we can make reasonable predictions on how that New Thing will behave. One given example is predicting whether a new person will get a bank loan or not, based on income level, debt level, and if other people around those levels historically paid back their loans.
Finally, risk analysis, business edition.
I will summarize Hertz and Thomas’s “Rick Analysis and Its Applications” from 1983:
Execs at a food company are thinking about launching a new cereal. Their primary input variables are advertising costs, the size of the cereal market, the expected market share for the new product, operating costs, and capital investment for the project. Analysts use the techniques above to make a “best guess” for each variable and can predict a 30% return on investment. The catch is that every one of these guesses has to come true. Math says that if every guess has a 60% chance of being right, there’s only an 8% chance all five will be right. Yikes!
Now imagine our variables are: user phish click rates, patching levels, vulnerability dwell time, and group policy coverage. Instead of return on investment, we’re interested in ransomware probability. That 8% looks pretty good, right?
From the Verizon DBIR, we know ransomware likes to take advantage of web-application level zero days. It’s most likely going to be a system-level intrusion of a server. Another scenario is old-fashioned phishing: Some user, somewhere in the organization, gets click-happy and ransomware is the result. Of course, a successful attack has to navigate any controls in place: awareness training, employee morale, group policies, network segmentation, vulnerability dwell time, and more.
Enter the Monte Carlo simulation. We have historical data, common probability distributions, and a dose of randomness, for giggles. For dynamic systems, we can add queueing theory for random arrival times and unpredictable service times.
To run our simulation, we need to first make the spreadsheet model and determine a good probability distribution for any uncertain input. Then, identify the output variable and the number of trials we want. Make a table that summarizes everything, run the simulation, and analyze the results.
It’s almost like we’re analysts and engineers.
Leave a comment