Cheltenham tech expert reveals how to manage when your server goes down

Your server has gone down – how long can you cope without your data? How quickly do you need to return to operation? These are questions every business should ask itself. SoGlos speak to Justin Richmond of ReformIT in Cheltenham to find out the all-important answers.

By Sarah Kent  |  Published
The key things to ensure in a company outage are good communication and a continuity plan, says ReformIT.
In partnership with ReformIT  |  reformit.co.uk
ReformIT

ReformIT is a Cheltenham-based IT support specialist, providing expert advice to businesses all over the UK. Assisting with everything from cyber security and cloud technologies to improving broadband speed, ReformIT can tailor its services to meet individual businesses’ needs – whether it’s a fully outsourced IT department or third line support.

Every business relies on some sort of technology to keep things running, but if yours is managed through a server, you'll know that an outage or crash can be devastating to operations.

ReformIT is a Cheltenham-based IT services and cyber security specialist. SoGlos speaks to senior IT account manager, Justin Richmond, to ask about the key things to be aware of and action in an IT emergency.

What are the first signs that a server has gone down?

In most cases, when people start shouting that they can’t access any data! 

If you have any monitoring tools, then you should receive an alert. Be it a print server or file server, we would know straight away if one of our client's servers had gone down as our monitoring tools would let us know.

What are the most common causes of server failure you’ve seen?

I would say the age of the server, where components start to fail; or software fails; power supply fails; or drives are too full and affect performance. 

In worst case scenarios, where ransomware attacks everything that is encrypted. 

How do you tell the difference between a minor outage and a critical system failure?

A minor outage is typically a low-severity issue that affects non-essential functions or a limited number of users. These are the bugs and hiccups that can wait for the next maintenance window.

A critical system failure, on the other hand, is a high-severity event that disrupts core business operations. It demands immediate attention, often triggering emergency protocols and cross-functional team mobilisation, including possibly implementing your business continuity plan so the business can operate. 

What are the most important steps to action first?

When a system disruption occurs, whether it's a minor outage or a full-blown critical failure, timing is important. The difference between swift recovery and prolonged downtime often hinges on what you do in the first few minutes. 

Before anything else, confirm that the issue is real. This might come from automated monitoring alerts, user reports or performance anomalies.

Use a predefined 'severity matrix' to classify the incident and figure out whether it's affecting a few users or your entire infrastructure; and whether it's a bug or a core functionality failure.

Once this is established, notify the appropriate stakeholders. For minor issues, this may be a ticket to the help desk. For critical failures, it means activating your incident response team (if you have one), looping in leadership and potentially informing external partners or customers.

You then need to prevent the issue from spreading, by isolating affected systems, disabling compromised services or revoking access temporarily. 

Communication is also vital. Inform internal teams and update external customers via status pages or a holding statement on your website.

Once you've figured out what's caused the outage, try to apply a temporary fix to get back online, while you're digging deeper for a permanent fix.

How should a business prioritise which systems to bring back online first?

When a business faces a system outage, the instinct to restore everything at once can be overwhelming. Knowing which systems to bring back online first can dramatically reduce downtime, protect revenue and preserve customer trust.

To do this, you would restore the systems that directly support your core operations, making sure those components are stable and working properly. 

Things like email servers and messaging platforms are important to prioritise for internal teams too.

If the outage stems from a security breach, restore firewalls, monitoring tools and access controls before anything else. 

What advice would you give to a small business with limited IT resources on preparing for any system downtime?

For small businesses, the impact can be costly and feel daunting without dedicated IT support. 

But there are some important steps you can take to help. First, document by creating a clear, accessible record of your systems, processes and recovery steps. 

Make a Return to Operations (RTO) plan, with an agreed time on how quickly your business needs to be back up and running; and identify an acceptable amount of data you could cope with losing.

Test the plan out in real life! And then rethink any holes or failures that occur.

If you could implement just one system or practice to reduce downtime risk, what would it be?

When it comes to reducing the risk of downtime, there’s no silver bullet. Every business is different, but, there’s one practice that stands above the rest – implementing a robust, tried-and-tested business continuity plan.

Downtime, or server outage, whether that's caused by cyberattacks, hardware failures or human error, isn't hypothetical. It happens every day. 

That’s why I always tell my clients that preparing for downtime is like having fully comprehensive car insurance. You may not have had an accident yet, but when it happens, you’ll be glad you’re covered.

A business continuity plan (BCP) is that insurance policy. It’s the framework that ensures your organisation can continue operating, even when critical systems fail.

A strong BCP should include: 

  • Clear recovery objectives that define your RTO and 'point in time objectives', so you know how quickly systems must be restored and how much data loss is acceptable.
  • Staff coordination protocols that outline who needs access to what and how they’ll get it during a disruption.
  • A communication strategy that ensures internal and external stakeholders are kept informed without causing panic.
  • And tested procedures that have run simulations and performed 'tabletop exercises', to test the plan under pressure.

Without a continuity plan, even minor outages can spiral into major crises. But with one in place, your team knows what to do, your systems recover faster and your business stays resilient.

To talk to the experts at ReformIT on how to manage a server or IT outage and how to set up a BCP, call 01242 236999 or visit reformit.co.uk.

In partnership with ReformIT  |  reformit.co.uk

More on ReformIT More

More on Cheltenham More

More from Business More