Real time service status from across the ATLAS Media Group portfolio
No incidents reported
No incidents reported
No incidents reported
We have identified major performance issues with the Universeodon Database which is currently causing upstream performance issues both to the website and content processing. Due to the ongoing US Election coverage we will delay a full resolution on this issue which would require a multi-hour outage to the site and our content processing and will continue to monitor the situation balancing the performance and access to the website with that of our other back-end resources.
The data migration has now completed. We're starting services back up and monitoring to see if this has resolved our issues. Further updates to follow.
We've taken all services offline and are starting the storage migration now.
From 15:30 UTC Today (6 November) we will be taking Universeodon offline to migrate the database storage with the expectation that this should resolve the performance issues we've started to see. This will then allow us to scale up our content processing and web tier to keep up with demand and should ultimately resolve a lot of the issues folks have been seeing. Apologies for needing to take the site down over such an active period in the world but we don't currently have a better option that will be a good experience for the Universeodon community.
We are currently recovering from an incident with our content processing service. This is resulting in a delay on new ingested content and intermittent delays to other content. We're continuing to re-balance our content processing capacity for the various types of content we need to process to expedite the processing of this content to the site.
The database storage migration fully resolved the issue.
Our content processing is once again falling behind with queues between 30 and over 1 hour. Ingress currently is at least 1 hour behind. I'm working to adjust scaling however we're seeing a significantly higher load than usual at this time combined with our known database issues is resulting in poor performance all around.
All feeds have now caught up. We will continue to observe live for a short amount of time before we sign off for the night and will check back in tomorrow morning to ensure nothing falls over for too long of a period.
All of our content feeds with the exception of ingress have once again caught up fully. We're running approx 30 mins behind real time now on our ingress queue which continues to shrink. I'm hoping in the next 20-30 mins we should have that queue down to real time as well and can get the content processing back to it's normal state.
We're now down to approx 55 mins of backlog on our ingress pipeline and have cleared a large amount of the backlog. We're seeing around a 10 min delay currently on processing other content which may mean you have a slightly outdated feed, this will resolve itself shortly and as soon as our ingress pipeline is processed the compute capacity will ensure any other content in the queue is processed which should keep things running smoother tonight.
We are continuing to see a steady reduction in our ingress queue and a parallel (approx) increase in our other queues as the posts get processed into timelines and other aspects of the application. We're currently still tracking around 1 hour behind on ingress and we're seeing a small spike where some timelines and other basic functionality is between 5 and 10 mins behind real time. Often a refresh will fix this or alternatively waiting until queues catch up. We're continuing to adjust the scaling of the service behind the scenes to account for these fluctuation.
We are monitoring the status of the queues and current data suggests the balance we have is slowly burning down our ingress queue while maintaining our other queues at near real time (Anywhere up to 2-3 mins lag currently). We will continue to monitor and update as appropriate.
I've managed to re-set our content processing scale back to a sensible default and cleared all queues except ingress which is currently around an hour behind based on the oldest item in the queue. I'm going to look to do some delicate scaling now to try to scale us up without breaking too much.
We're starting to see major disruption to the site, likely a result of an increase in traffic to the website itself further stressing our capacity. We've paused content processing for a moment while we look to see how we can more safely scale back up the content processing without crashing things further.
We have reverted our additional server as we're finding the capacity constraints on our database to be too great to allow us at this time to scale beyond where we already were. We're looking to restore full service as a matter of critical urgency.
I'm currently trying to increase our databases capacity without needing to take the entire service offline for an extended amount of time, the hope being that this may unlock additional capacity for the second content processing service which is currently unable to connect as we have saturated the capacity of our database. Updates to follow.
We have confirmed that part of the impact of this issue relates to ongoing performance issues on our database. We are currently building out an additional content processing server to attempt to pick up some of the slack and reduce the backlog. We're currently still tracking an approx 2 hour lag on new content being ingested through our queues with all other queues currently remaining near real time.
No incidents reported
No incidents reported
No incidents reported
It appears that at around mid-day UK time our content processing services crashed due to an ongoing bug in the processing engine. I've got the service back online and we're working through the content backlog.
We are back fully online.
The site is now mostly back online however we seem to be seeing huge spikes in traffic which is significantly disrupting the site. We're working to get things back up and running but have limited capacity to scale up further than we already are.
In attempting to restore content processing it appears we have caused major disruption to the entire site. We're working to get things back up and running.
No incidents reported
No incidents reported
No incidents reported
No incidents reported
No incidents reported