Do you know your Problem Management facts from fiction? Now that we’ve busted the top myths on Change Management and Incident Management, we’re ready to set the record straight on Problem Management with our favorite ITIL Mythbuster, Vawns Murphy. I assure you we’re all in good hands here. As a 15+ year veteran and certified ITIL/ITSM expert, Vawns is overflowing with best practices, stories, and insider information that can optimize any IT organization. Jump in to put your Problem Management knowledge to the test while picking up new proven tips to help you in your day-to-day service desk management.
Short answer: In an ideal situation, you would have a tool. But the reality is, not every organization has the budget for a Problem Management tool. However, that doesn’t mean you can’t get started with the basics. Use word documents for Problem records and use a spreadsheet to keep track of them while you build up your process to demonstrate the value.
Oftentimes I’ll meet with an organization that’s looking to get to that next level. They have Incident Management but it’s not under control and the same things keep breaking over and over again. The Service Desk and support teams are getting overloaded and user complaints are on the rise. I ask my client if they have looked into doing Problem Management to do some proper root cause analysis and fix things permanently. And the response I often get is: “Well, we don’t have a tool and we don’t have the money for it. We can’t write a business case because we can’t show the value without doing Problem Management.” It becomes a vicious circle where the more it’s needed, the less likely it is to happen.
In a perfect world, you’d absolutely have a Problem Management tool because Problem Management is a core ITIL process and it absolutely deserves a tool. But while you’re getting started, and trying to show the value, basic MS Office programs can help. I’ve created Problem records just by setting up a template in Word and having a separate document for each record. That way, if you have an Incident or Major Incident, at least you can say it relates to Problem Record 1, 2, 3, and 4 on SharePoint and link it to the SharePoint site. Now it’s clearly not ideal or perfect, and it’s not going to work long term because it will get more and more painful but it is a way to demonstrate value so you can make a business case for getting your hands on that Problem Management tool.
Tips to Tracking Problems Using MS Office
This is a simple, two-step process. First, you template Problem records in Word. Then you use an Excel spreadsheet to keep track of everything for about 3-6 months. That’ll give you enough time to demonstrate the value of Problem Management because you’re able to look at your big hitters. Typically, the two areas I focus on first are:
1) Major Incidents – the big, painful things that we talked about in our previous post. The goal here is to actually get to the root cause of these issues and fix them. It’s important to look at things like: What went wrong? What caused it? How do we fix it? How do we stop it from happening again? Are there any opportunities for improvement? (Whether that’s lessons learned or CSI).
2) “Boring” Incidents – You know the ones. The repeat or reoccurring Incidents that keep happening over and over again because no one has had the time to establish the root cause and fix it permanently.
By doing some trend analysis in your Incident Management tool, you can identify and fix both Major Incidents and Repeat Incidents, resulting in significant wins and cost-savings. In fact, you’re removing the potential for multiple Incidents so perhaps you’re saving money on fines or legal settlements, service credits, etc. And keep in mind that success breeds success. The more successful it is, the more people will evangelize about it and jump on the bandwagon. And suddenly you’re in a position to justify a new toolset or the licensing for an additional Problem Management module for your existing tool. But if you don’t make a start, you’re never going to get there.
Short answer: Incidents and Problems are two completely different things. Incident Management deals with coordinating the incident, managing communications with both technical support teams and business customers, and ensuring that the issue is fixed ASAP. Problem Management focuses on root cause investigation, trending, finding a fix, and ensuring that any lessons learned are documented & acted on.
I was at an organization a few weeks ago and they were really struggling with the difference between Incident Management and Problem Management. During my presentation, I asked, “We’ve all heard of Batman vs. Superman, right? We’ve all seen the film out earlier this year? Well, have you heard of Batman vs. Columbo?” Luckily everyone was old enough to remember Columbo from the crime drama series (God, I loved Columbo!).
Incident Managers are the superheroes of ITIL and their motto is “fix it quick.” The goal is to swoop in, save the day, and get everything running again—just like Batman. Incident Management also deals with coordinating the Incident—managing communications with both tech teams and the business customers —to ensure that it’s fixed ASAP.
In contrast, Problem Managers are like Columbo. They are the diligent detectives who come in after the event, ask all the questions, and figure out what happened, why, and how to fix it. With Problem Management we’re looking at the root cause so trending is extremely important. Has this issue popped up before? Is this a recurring thing? What is causing it? Next is finding a fix (both a temporary workaround and a permanent solution), working with Change Management to get that fix in safely, and then making sure that any lessons learned are captured, documented, acted upon, and built into CSI.
Incidents are interruptions to an IT service, a reduction in the quality of that service, or a failure of a Configuration Item (CI) that has not yet affected service. But they are instances where something’s gone wrong. Incidents don’t become problems. They are blips or downtime, and they are unplanned.
Problems are the why. They are the underlying root causes of one or more Incidents. For example, an Incident could be that five people can’t get their email while the Problem record could be that the server is experiencing performance issues because it’s not patched to the optimal level. The Problem Record is there to look for the root cause and to figure out a fix.
While we’re talking about definitions, we might as well throw “Known Errors” into the mix. While running ITIL Foundation classes, I’ve found that it’s common for people to get confused between Problems and Known Errors. A Known Error is a type of Problem Record where we’ve figured out the root cause and we have a workaround. We don’t have a permanent fix yet but we do know what the root cause is and we have identified a temporary solution. We log these in a Known Error Database in our Problem Management Tool and we make sure they are tracked, updated, and worked on. Known Errors can be raised proactively by support teams or by suppliers, vendors, or third-parties.
For example, I’ve been flying a lot recently, speaking at the itSMF Ireland conference in Dublin and at IT in the Park in Edinburgh. And on both trips, the flight attendants made safety announcements regarding an issue with the Samsung Galaxy Note 7 where it can overheat and become a fire hazard. In this instance, we know there is an issue—so we either don’t travel with it or you take the battery out and have it turned off. That’s a perfect example of a Known Error that was identified by Samsung and they had to reach out and tell all their customers about it to implement a workaround.
Short answer: Of course you do! It’s important to look at both the proactive and reactive sides of your Problem Management process and find the balance between the two. If you focus on reactive activities only, you never fix the root cause or make it better; you’ll just keep putting out the same fires. If you focus on proactive activities only, you will lose track of the live issues causing the most pain and your service quality could spiral out of control.
Listen, I get it. We live in a world of “I want it now.” We want it fixed and then we move on to the next crisis, drama, or sparkly thing because we’re geeks and engineers and we love this stuff. But if you focus on the reactive side only, you’ll never find and fix the root cause or make it better. You’ll just keep putting out the same fires time and time again.
Conversely, if you focus too much on the proactive stuff, then you’ll look at all the “what ifs” and potentials and you’ll get completely sucked into that rabbit hole. Then you might lose track of what’s going on in your production environment—what are the most pressing issues? What’s causing the most pain?—and your service quality can take a turn for the worse.
It’s really important to have a balance. When starting out, your focus is naturally going to be on reactive activities but be sure to build in some time to be proactive as well. Whether that’s 5-10% of your time or maybe it’s one meeting a month where you speak to people and you ask what’s worrying them. You can always extend it over time as your process matures and you get a tool and more resources in place. But have that little opening in the process from the beginning so that you can genuinely say you are looking at both sides. With that, eventually you’ll get that balance.
What proactive actions should you take?
Proactive actions could include working with Availability and Capacity Management to ensure that uptime and performance concerns are addressed in new services, trend analysis to identify recurring incidents, and working with support teams to make sure that business critical business systems have the appropriate maintenance (e.g. regular patches, reboots, agreed release schedules).
I once worked on a client site where I had a meeting with the user community to work on setting up user forums. During this meeting, one of the business leads said to me, “We always have problems at month’s end.” I asked if he told anyone and he responded, “No. We don’t tell the service desk because nothing gets done and we’ve given up logging it.”
I ended up doing some historical trending and I noticed that for the past year, there was a spike in performance issues around month’s end just as described. And no one could figure out what was causing it. I ended up borrowing the Technical Observation concept from ITILv2 which is basically: If you have a Problem and you have absolutely no idea what’s causing it or how to fix it, you get someone from each area (Networks, Voice, Wintel, UNIX, LINUX, IT Ops, Applications Support, etc.) to look at it. It turned out that it was a combination of things including network contention across the LAN, batch jobs that needed to be optimized, a server that wasn’t patched properly, and insufficient maintenance jobs.
It took roughly six weeks to work through the issues, but at the end, we shaved two whole hours off the overnight processing time and it completely nixed the performance issues at month’s end. Had we not done that trending analysis and realized when it started and finished every month, we would have never been able to resolve it.
Talk to People
This is something that isn’t covered as well as it should be in ITIL. This simply means talking to your tech support teams and getting their thoughts on items that haven’t fallen over yet but it’s only a matter of time. Chances are that if it hasn’t gone wrong so far, there isn’t going to be much urgency to get it fixed, but at least it’s on the radar. This way we can come up with a plan so when it eventually does fall over, we’ve already done the pre-work and we know the different options of what it would take to fix it.
When speaking with your service delivery managers and relationship managers, ask them what things keep them up at night? For your relationship managers or SDMs, nothing focuses their minds more than if they’ve got a monthly service review meeting with their customers the next day and things are not going well. Relationship managers have that holistic view of the end-to-end service, including all the things that have gone well and all the things that have gone wrong. That is valuable information. Ask them what things worry them—because it’s likely they will be different from what your executive teams will tell you.
It’s also important to speak with your customers to ask what they’re worried about most. Do they have any business critical training times or major events coming up? For example, Alliance Healthcare, a subsidiary of Walgreens Boots Alliance, makes 65% of their annual profits at Christmas time. During this critical time, they have a Change freeze on all transactional and financial systems to make sure nothing destabilizes the business.
The bottom line is just get out there and talk to people. I know it sounds basic and simple but we don’t do it enough. In IT, the temptation can be to focus on all the gadgets and latest and greatest tools but we also need processes and people. Otherwise, the tools will not be nearly as impactful.
Read the entire ITIL Mythbuster Series!
Katie McKenna is the Digital Marketing Manager at EasyVista, managing all aspects of social media and the company’s web presence. She enjoys learning and sharing all things ITSM, IoT, SaaS, and IT Consumerization. Katie is also an avid reader, pizza enthusiast, and horror movie lover. Follow Katie’s latest tweets on EasyVista and industry news at @EasyVista.