Operations Guide

Module 5: Proactive Processes

Problem Management

The steps described in the implementation guide should be applied to all problems and the process should be monitored regularly to ensure that it is effective.

What needs to be done

How

When

Who

Record problems

By following the steps described in FITS Problem Management implementation guide to:

identify problems through incident management, network monitoring and preventative maintenance
create problem records to initiate problem resolution.

When problems are identified

Problem logger

Assign problems

By following the steps described in FITS Problem Management implementation guide to:

assign problems to problem resolvers (technical support staff)

As soon as problem records are created.

Problem logger, problem resolver, problem manager or process owner. This will be a local procedure agreed within technical support.

Resolve problems

By following the steps described in FITS Problem Management implementation guide to:

investigate, diagnose and resolve problems
complete problem records and return them to problem managers
provide timely problem status information to problem managers.

When problems have been assigned. Work should be scheduled through the change management process and for this reason the problem resolver must not also be an incident resolver (to avoid scheduled work being postponed in favour of incident resolution).

Problem resolver

Manage problems

By following the steps described in FITS Problem Management implementation guide to:

track progress of problems from initial investigation through to resolution
proactively seek updates on progress, if they are not forthcoming
update problem log details.

When problems records have been assigned and throughout their lifecycle to resolution.

Problem manager

Monitor the process

Gather statistics from the problem log and use the problem management report template to display them graphically. See problem management report example. It will be necessary to make some manual calculations, which can be done using the data you have. It will be worth your while to invest time in this until you are ready to automate your reporting (see continuous improvement). Change the problem fix times in the report to suit your own needs as required*.

Align your problem reporting cycle to your incident reporting cycle. Monthly should be sufficient. You can alter the report to produce weekly figures but you will probably find the reporting exercise too time consuming to justify preparing it more frequently than once a month. If you do change it, be consistent, so that the emerging trends are not distorted.

The process owner is responsible for monitoring the process. However, it is acceptable for them to delegate the preparation of reports to someone with suitable access to the data (problem manager/service desk administrator).

Assess the effectiveness of the process

Analyse reports and question variations from one to the next. Never take statistics at face value, always investigate the nature of the problems and the circumstances surrounding their resolution.

As soon as they are published. Identify trends as they appear. Reasons for variations between reports and trends over time will be easier to identify if the data is recent.

The process owner is responsible for interpreting reports.

Problem fix times are not defined in the FITS service level agreement template. From an end-user perspective they should not be necessary, because effective incident management removes the need for problems to be fixed quickly. Problem fix times are included in this report for the purpose of internal technical support performance monitoring only.

How to operate problem management

How to deal with major incidents or major problems

A major incident or problem can be classified as one that causes serious disruption to the computer service in the school. This can include:

a virus outbreak or threat
closure of internet services
file server failure
partial or total network failure
building problems – for example, fire, smoke, flood or frost damage
software problems affecting over 30 per cent of computers.

Major incident process

Service desk to field all calls and reschedule planned incident responses
Technician to be notified of the major incident
Technician to identify the extent of the problem before taking any action
School to allocate an additional person to help the technician
Additional person to be responsible for communication between the technician and users and to provide ad hoc help to enable the technician to deal with the incident
Technician to discuss with the school leader the extent of the problem and a planned response (the school leader needs to know how to reschedule the planned work that involves the affected computers)
School leader to ensure that the technician has the necessary resources to deal with the problem including time
Technician to decide how long to continue with trying to fix the problem before calling on the school's 'disaster recovery' option

The seven stages of the Problem Management process

Notification of a problem

It is either the single point of contact at the service desk or the technician who will decide if an incident is really a problem. The user will notify an incident in the usual way using the incident form. When the service desk receives the form it will be checked. With experience, service desk staff will know if this is to be passed through Problem Management. Otherwise they will pass it to the technician in the usual way.

When service desk staff check the incident sheet, they may notice the following:

The same type of incident was reported on several other computers in the last few days
The same type of incident was reported on this computer in the last few weeks
This is yet another regularly recurring fault on the same computer.

Staff can do these checks by looking at the call log or by using simple searches using a ‘find’ function to spot certain words or phrases – for example 'network card', 'pc24' (if this is a computer's unique reference) or 'printer jam'.

The single point of contact (SPOC) at the service desk may record this information on the incident sheet and inform the technician when placing the call.

Requesting technical support

If the service desk has decided that this is a problem, it must be passed to the technician and no further diagnostic work is required by the service desk.

The technician is informed in the usual way about the call and the person on the service desk will advise why they think it should be treated as a problem.

If the school has 'swap-out spares' that a competent person in the school can install, the technician may advise doing this before any further work is done to the faulty equipment. The benefits would be:

time saved waiting for the technician to attend to the call
reduction in the time that the equipment is unavailable
an opportunity for the technician to investigate the problem without being under time pressure.

Problem analysis

Problem analysis uses common sense, asks lots of questions and should not be too far-fetched with the final theory.

Technicians should remember that using phrases like 'power line glitch', 'infrequent reset phenomenon', 'intermittent random fluctuating memory address' and other such odd-sounding phrases do not impress the user. If you are not sure what is happening then say so.

Honesty is the best policy. As long as you have an answer – which may be to replace the equipment or that a purchase is required – this is fine. (See problem analysis tools in the Problem Management toolkit)

Produce theory

From the evidence, analysis and experience, produce a theory for what is happening – or what has happened. Then use the theory of 'what went wrong' to produce actions to resolve the problem. Your first theory may not always be correct and you should try to show why this is. Avoid theories you can't explain!

Produce resolution

Pause before taking action. Write down exactly what was done, and the outcome. Do this for all actions you took – even if it consisted of just one line of a system set-up file. The step-by-step actions should be able to be traced to find out what the technician did to resolve the problem and therefore help in resolving future problems. Problem management is a huge learning curve, and although it is time consuming, making notes is very important.

Results of resolution

The results of the resolution may affect many systems in the school. If a plan is to be drawn up to replicate the actions across other systems, this must be done using the process described in Change Management.

Problem closure

Update the incident diagnostics sheet and the incident sheet and pass them to the service desk. The service desk performs the usual call closure operations.

When does problem management occur?

It is worth while setting aside time during the week to devote to problem management. It requires careful thought and cannot be hurried. The technician will need to be left undisturbed to work on it and should not be required to be in attendance for incidents. This approach should ensure an effective result from the process, which will benefit the school.

The proactive aspect of problem management is to monitor equipment and analyse incidents. The results of monitoring should be analysed to detect potential problems and provide a solution that can be implemented before failure. An example of this is to monitor disk space usage to remove temporary files, to archive files and to clean up disks before they become full and create network-wide problems.

Checking logged incidents can show trends such as printing problems where, for example, one printer often fails to complete printing and will print text but not pictures. It could be that this call is only logged occasionally, but if it was found and the user told 'that this printer can not do this type of print' it would save the time of the user and technical support as there is no fault to report.

Problem analysis may indicate that a small memory upgrade to the printer is all that is required or that it would print the picture if it were a different file type. Ultimately getting the best use out of the printer may avoid the expense of replacing it.

Who carries out problem management?

Problem Management usually starts with an incident or as a result of monitoring:

The service desk will investigate an incident and after following guidelines will decide whether it is a problem for deeper investigation and analysis.
A technician will start working on an incident and decide if it qualifies as a problem.
Network monitoring will highlight areas that need further checking to find out 'whether there is a potential problem'.

Problem management is the ‘black art’ bit of technical support. From the evidence, analysis and experience technicians produce a theory for what is happening or has happened. The theory needs to be believable, so before taking action you should show why your theory might work.

You produce an approach to resolving the problem, check it for soundness and then implement it. This is an iterative process that you may have to go through several times before you find the correct solution.

See the roles and responsibilities section for details of the roles and functions in problem management.

Problem management resources

Problem-analysis tools

A range of tools exist to assist problem analysis, but it’s worth remembering that you need to be familiar with using the tool if it’s going to produce results, and there is a cost associated with the time spent on problem solving, so reserve problem solving for expensive issues!

Root cause analysis

This is the process of finding the real cause of a problem and dealing with it rather than simply continuing to deal with the symptoms. It seeks to identify the reason for the failure by asking lots of questions and determining whether changing an event early on in the chain of events could have prevented the failure. Ways to implement the change are decided and actioned through the Change Management process.

Error code look-up

This is where you find out what a displayed error code means. Often the user manual or technical manual cannot be found or it does not detail the error codes of the software. Using search engines you can look up the error code, the model of the equipment and the operating system to get a filtered response that may guide you towards the reason for the error.

Fishbone diagrams

This diagram, also referred to as a cause-and-effect diagram or tree diagram, displays the factors that affect a particular quality, characteristic, outcome or problem. The end product is typically the result of a brainstorming session in which members of a group offer ideas on how to improve a product, process or service. The trunk of the diagram represents the main goal, and primary factors are represented as branches. Secondary factors are then added as stems, and so on. Creating the diagram stimulates discussion and often leads to increased understanding of a complex problem.

Technician’s forms

The technician forms are designed to aid technicians in doing their job. It is always useful to record events as they occur, as this helps to ensure that you leave nothing out. If your records are not comprehensive – which may happen if you don't complete the form at the time – you may omit a seemingly obscure piece of information that later proves to be the key to resolving the incident or problem.

Problem management checklist:

Do you spot problems before an incident occurs?
Do you record resolutions for future reference?
Do you resolve known errors before they become an incident?
Do you minimise the adverse effect on users when an incident occurs?
Do you analyse incident trends to prevent further incidents?
Do you allocate enough time for problem management and do you review the allocation periodically?

We have provided a technician’s diagnostic sheet in the Toolkit section

Analysing reports

The measurements you have gathered are similar to those for incident management and, like incident management, they are the starting point for helping you to understand what is happening in problem management in your school. They do not provide any answers, only the questions you must ask.

Look carefully at the statistics and then find out their causes. Identify trends and ask questions, look back through problem records, talk to problem management staff. Dig around for the root cause. It is important to identify and resolve issues with the process to ensure that it continues to operate effectively.

Measurement

Purpose

Example questions

Number of problems logged

To monitor whether volume is increasing or decreasing over time and understand the cause

Has anything specific happened to cause an increase or decrease?
Is the change management process working effectively?
Have school holidays affected volumes?
Has a problem record been completed for every problem?
Was a problem record completed for every problem in the last period?
Could a decrease indicate waning enthusiasm for the process?
Could an increase indicate a surge of enthusiasm for the process?

Number of problems closed

To monitor the performance of problem management staff

Has the number of problems closed gone up or down since the last report?
Were more or fewer problems logged in the last period?
Were problems more, or less, complex to resolve than those in the last period?
Were staff levels similar from one period to the next?
Was workload affected by other factors?
Were problem management staff diverted to working on incidents?

Number of problems fixed within 1 week

To monitor performance against internal targets. With effective incident management, problem resolution within 1 month should be adequate and achievable.

How does this compare with previous months?
What was the complexity of the problems?
Were staff levels consistent?
Was workload affected by other factors?

Number of problems fixed in more than 1 month

To monitor performance against internal targets. Even though effective incident management should be providing acceptable customer service, you may find that failure to resolve underlying problems in 1 month or less will stretch your resources and increase risks to service provision.

Were staff levels consistent?
Were there supplier delays?
Was resolution held up by the change management process?
Are outstanding problems scheduled but waiting for new releases (for example, a software patch)?
Was workload affected by other factors?
Were problem management staff diverted to working on incidents?
Do you have enough spares left to deal with future incidents?

Problem management reports should identify where isolating problems from incidents has provided benefit.

Back to Problem Management

Members Only Content - Please LOGIN OR purchase below

FITS Member

This content is for members only. Please purchase below to get instant access.

Special Limited Time Offer

Get full member access for only £4.95/m

We are currently offering full access to the members area for a very special rate.

Get your discounted access Now

Already a member? - Login Here

Some content on this website is provided under the provisions of the Open Government License.
All other content including, but not restricted to, website design, images logos, etc.