Monitoring issues and informing about alerts

Last updated: February 7, 2024

Edit on GitHub

Every Spryker cloud environment is monitored by the monitoring systems and a dedicated 24/7 team that ensures that environments run stable. If an issue occurs, this team takes action autonomously and informs you about any further actions needed on their or your side. This document explains how monitoring and alerting works.

What is monitored?

The following alerts are configured by default for all environments.

Endpoint status

NAME OF METRIC	DESCRIPTION
External monitor: HTTP check Yves/Zed /health-check (Draft)	Checks if Yves and Zed health-check endpoints provide an expected response.
200 responses on Yves / Glue in ALB	Checks if Yves and GLUE provide the 200 OK response to the application load balancer.
200 responses on Zed in ALB	Checks if the Zed endpoint provides the 200 OK response to the application load balancer.
AVG Response time in ms	Checks page response time and runs on a pre-set threshold.
ALB 5XX response codes	Checks if the application load balancer receives 5XX status codes from the load-balanced applications.
ALB Healthy Hosts in Target Group	Sends health checks to the registered targets.

ElastiCache/Redis

NAME OF METRIC	DESCRIPTION
ElastiCache Status	Checks the status of ‘ElastiCache for Redis’.
Redis is not used by any service	Checks whether ElastiCache is used.
Redis available Memory	Checks if Redis free memory on the host is running low.
Redis High CPU	Checks if the Redis service is high on CPU usage.

Elasticsearch

NAME OF METRIC	DESCRIPTION
ES Cluster Status	Checks the status of the Elasticsearch cluster.
ES available storage	Checks the available storage of the Elasticsearch cluster.
ES High CPU	Checks if the Elasticsearch service is high on CPU usage.

Database

NAME OF METRIC	DESCRIPTION
RDS Status	Checks the status of the RDS.
RDS IO Credits	Checks if the RDS instance is running low or is running out of IO credits.
RDS available storage	Checks available storage of RDS.
RDS High CPU	Checks if the RDS service is high on CPU usage.

Scheduler

NAME OF METRIC	DESCRIPTION
Jenkins Status	Checks the status of Jenkins.
Scheduler disk is 90% filled	Alerts the monitoring team once Jenkins disk utilization is at 90% or above.
Scheduler inode usage is above 90%	Alerts the monitoring team once Jenkins inode utilization is at 90% or above.
Jenkins container can’t be deployed	Checks if there are deployment failures of the Jenkins container.
Jenkins High CPU	Checks if the Jenkins service is high on CPU usage.
Jenkins High Memory	Checks if the Jenkins service is high on memory usage.

Message broker

NAME OF METRIC	DESCRIPTION
RabbitMQ web page isn’t accessible	Checks if RabbitMQ web UI is reachable.
RMQ: status by host	Checks if the RabbitMQ host is reachable.
RMQ: disk alarms	Checks the status of the storage attached to the instance which RabbitMQ is running on.
RMQ: memory alarms	Checks the memory utilization on the RabbitMQ instance.
RMQ: missing queues	Checks for missing queues on the RabbitMQ instance.
RMQ: High CPU	Checks if the RabbitMQ service is high on CPU usage.

Elastic Container Services (ECS)

NAME OF METRIC	DESCRIPTION
ECS Service Status	Checks the status of all ECS services.
ECS Service High CPU	Checks if the ECS service is high on CPU usage.
ECS Service High Memory	Checks if the ECS service is high on memory usage.
ECS Service Auto-Scaling	Monitors the activity of the ECS service auto-scaling.

Deployment and miscellaneous

NAME OF METRIC	DESCRIPTION
ECR Image scan results	Scans the images used for the application build process. If high severity vulnerabilities are discovered, an alert is triggered.

In what cases do we contact you?

Most alerts we receive are temporary states and aren’t a real problem. However, some alert patterns require us to take action. There is a difference between critical issues in production environments and staging or test environments. We focus on critical infrastructure problems in production environments.

If there is an alert, the monitoring team looks for a likely explanation, like a temporary deployment error. If the team needs to take an action, they decide if it requires consent or cooperation from your side. If cooperation or consent is needed, they will contact the responsible person by opening an alert case. If a Solution Partner is responsible, you can track the status of these alert cases and the tickets in the Partner Portal. We update them on a daily basis in the Case Detail view under Ticket Status.

If it’s possible to resolve an issue on our own, we’ll do it. In case of a downtime or severe service degradation, we inform you about the incident via a case and a Root Cause Analysis [RCA] report.

Was this article helpful?

Tell us what could be better

Name (Optional)

Email (Optional) Invalid email format.

Type your thoughts here

We appreciate the time you took to share the feedback
and will consider it as soon as possible.