Date of the incident: July 14th, 2022 9:47 CET


Duration: 9 calendar hours


Affected services: Job processing and imposition taking longer times than usual 


Issue timeline: 

  • July 14h - 10h00 CET - Job processing microservices can't communicate properly with the system DNS. The malfunction causes microservices replica to slow down and eventually stop. Thus, causing a backlog increase during the event
  • July 14th - morning - DevOps team starts working on the action plan aligned with AWS. That includes checks across several parts of the cloud configuration.
  • July 14th - afternoon - Several DNS components are updated and modified. DNS communication errors disappear and microservices are able to scale up again.
  • July 15th - new monitoring system set in place to follow up on job processing performance over the weekend and beyond

 

Root cause:

During the weeks before July 14th, the DNS that manages the communication between the many Site Flow cloud services caused some random and sporadic issues. Thus causing some jobs to be re-fetched or processing tasks to be re-run.

Efforts to fix the issue had not been completed when on July 14th DNS errors 
caused the prepress pipeline to severely slow down.

The DNS errors have been fixed on July 14th thanks to an update on the configuration of several components of the DNS.

 

Path forward: 

Sytem observability has been improved to be able to react faster to system performance deviations.


Terms/Glossary

  • Maintenance Event” means maintenance of the Services that require its interruption;
  • Scheduled Maintenance” means a Maintenance Event in respect of which HP has given the Customer at least twenty-four (24) hours prior written notice;
  • "System degradation" means that the customer is unable to utilize Site Flow as usual so his business is being impacted but the situation is not yet an Outage.
  • Incident” means any set of circumstances resulting in an Outage;
  • Outage” means, that the Customer is unable to access all parts of the Site Flow Subscription service via both API and web-browser log-in, AND all transmitted orders directed to the Customer’s Site Flow account are not being acknowledged (i.e. the entire Site Flow service is “down”).
  • Working Hour” means, the hours between Monday through Friday 09:00-17:00 local time, excluding national and HP designated holidays.
  • "Calendar hours" are regular full-day hours and they cover everything around 24x7. Correspondance with working hours will depend on the actual customer timezone.