Real time service status from across the ATLAS Media Group portfolio
No incidents reported
We're opening this incident to track and keep you updated on the maintenance we will be performing throughout the day. We expect and plan for Universeodon.com to complete the maintenance a fair bit earlier than MastodonAppUK, this is in part to minimise both sites being down allowing us to focus on one at a time and partially because MastodonAppUK has more assets to migrate to the new infrastructure.
You can follow this incident to get updated on when we're taking sites down and our progress overall throughout the day.
The cluster network issues were resolved under a different incident. As the migration going forward will be involving near zero downtime (Planned) we will spin up separate incidents in each case where we are going to potentially need to take the site offline for a short window.
Yesterday we completed the move to retire one of our legacy cluster servers, this means MastodonAppUK is currently running at reduced resiliency and reduced capacity for the website. Universeodon has all services currently operating.
Unfortunately the MastodonAppUK migration was just taking too long and we don't have capacity to support the migration running into the early hours of the morning. We've aborted the migration and will also scrub down and reset the new server ready to attempt a different solution tomorrow at some point.
The Universeodon DB Migration has catastrophically failed again and as a result we've restored the site using the legacy infrastructure for the time being. We will look to perform the migration now tomorrow afternoon / evening UK time and will look to use an alternative way of operating the migration. We're seeing extensive performance issues on Universeodon due to heavy latency between our legacy cluster and new cluster.
The MastodonAppUK DB Migration appears to be running fine at this point, subject to it completing in the near future we can switch over to using the new DB tonight with further maintenance now required also tomorrow.
I've done some digging into the Universeodon DB Issues, it looks like when it previously failed it didn't re-start where it left off and is re-migrating from scratch and will then purge all of the original records once the new records are written, in this case our statuses table (Which is over 150GB in size) is having some weirdness. I'm going to let it run for a bit longer but if it isn't complete within the next 1 hour we will have to abort the migration and re-start everything on the old DB before purging what we tried to copy across and re-running the migration tomorrow.
I'm back and checking everything now. The Universeodon migration appears to still be happening though it is reporting a larger DB now on the new server than on the old so it's likely something has gone pretty significantly wrong there. The MastodonAppUK migration appears to still be underway and slowly progressing.
Unfortunately the Universeodon database migration is taking considerably longer to complete compared to what we expected and is not going to be complete before I need to go offline for the evening.
With that in mind I'm going to re-start the MastodonApp migration and the work will continue later this evening. I'm hoping everything will be online before the end of today.
We have stopped the MastodonAppUK DB Migration to free up additional network capacity and IO capacity for the Universeodon transfer as I have a hard stop and have plans this evening so won't be around for a few hours. I'm hoping the remaining 50-60GB of Universeodon data will transfer fairly quickly and I can get the site online before I need to leave this evening and MastodonApp will then resume the transfer once Universeodon's is complete with the goal to bring the site back online tonight.
The Universeodon transfer is continuing and appears to be around 3/4 of the way there now. I'm hopeful it will complete in the next hour or so and enable us to bring the site back online.
MastodonAppUK's transfer is going slower than we would like and is likely to now result in the site being offline until later tonight due to availability issues this afternoon.
The new MastodonAppUK DB Server is now setup and we're going to take the site offline now to enable the migration to the new environment.
We're continuing to monitor the progress on the Universeodon one, it appears to have picked up where it left off as far as we can tell.
Good Morning, I'm re-starting the maintenance now.
The Universeodon transfer appears to have hung around 5:40 this morning (UK Time). I'm hopeful I can re-start it where we left off. I will keep you updated.
Once I'm happy the Universeodon transfer is working I'll switch focus to the MastodonAppUK one.
It looks like the Universeodon DB Migration is due to take another 5-10 hours (It's hard to tell) given the volume sizes I can see and how much has transfered so far. With that in mind we're going to have to leave the site off overnight tonight and I'll look to get everything operational first thing in the morning once I have woken up as it's now 1AM here and I do need sleep! Apologies for the extended outage this very much was not the intention.
MastodonAppUK is now back online and we're going to pause our work for the site there.
Universeodon is still mid-migration so we're going to wait for that to complete and finish the switch-over before calling it a night.
With how long the migrations and upgrades are taking (And how far over the maintenance window we already are) once the MastodonAppUK Upgrade completes we will bring the site back online and pause the work there migrating the DB to the new server tomorrow.
Universeodon is still in the process of migrating once that's complete we will get configurations updated and the site online tonight which should conclude the disruptive maintenance to the site for the time being.
The Universeodon transfer is now underway and I'm finishing the configuration on the new DB Server ready to start to switch the site over to it. I'm expecting there to be some hickups while we switch over but the priority is going to be to get the web tier online first and then the content processing afterwards. There are a reasonable number of configuration files I need to update across the servers once we're moved over.
The MastodonAppUK Upgrade to Postgres 17 is running and once complete I'll bring the site up again to validate all is working before repeating the transfer that's currently underway for Universeodon.
The MastodonAppUK Backup has now completed and the upgrade from 16.x to 17.x is now underway, this is likely to take 2-3 hours from what we've seen.
The new DB Server for Universeodon is now ready and we're going to take the site offline to facilitate the data migration, we expect this also is likely to take between 1 and 3 hours after which we will switch all traffic to our new cluster where we expect performance to be noticeably better.
Universeodon's DB upgrade has now completed, we're getting the site back online now to make sure everything is working. We'll keep the site up for a short period to ensure everything is stable and then we'll look to take it offline one final time to migrate the DB onto our new cluster.
MastodonAppUK's DB is currently backing up and then we will start the 16.x to 17.x upgrade process.
The Universeodon Postgres update continues.
We're taking MastodonAppUK back offline now to perform the 16.x to 17.x upgrade on Postgres before we then move it onto the new cluster.
MastodonApp's back-end database has now been upgraded to Postgresql 16.x. All services have now been restarted and we're letting everything run for a little bit now to make sure everything is performing as expected.
Universeodon is currently progressing through the 16.x to 17.x upgrade, once that is complete we will get the site online to test before we perform the migration of the database onto our new server cluster. Unfortunately this means the Universeodon maintenance is running over schedule and we don't currently have a firm estimated completion time.
The MastodonApp Database has finished it's backup and is now in the process of upgrading from 15.x to 16.x
We are going to stop the Universeodon services now and get the Universeodon database moved from 16.x to 17.x ready for the final part of the maintenance to move it to the new compute cluster.
Universeodon is currently operational on the upgraded Postgres version. We're going to leave the site online for a short while longer while we perform backups for MastodonAppUK's Database, once MastodonAppUK's DB starts it's 15.x to 16.x upgrade we will take Universeodon back offline and prepare it for the 16.x to 17.x upgrade.
Universeodon's upgrade of Postgres from 15 to 16 should be complete, we are brining the site online now and letting things run to test then we will finish the cleanup of the 15 install before moving from 16 to 17.
We're about to start the same process now on MastodonAppUK which will result in some intermittent outages followed by a somewhat extended outage while we perform the upgrade.
The Universeodon database is currently in the process of upgrading from 15.x to 16.x, it may take a little while due to the size of our database and the historic content we store being well federated. We will look to bring the site back online, ensure everything is running smoothly and then we will repeat the process to upgrade from 16 to 17.
The first stage of the Universeodon maintenance has gone well. We now need to take the site offline again and will be performing an upgrade of our Postgresql instance from 15 to 16 before we bring the site back online to verify everything is operational. We will need to repeat this afterwards for 16 to 17.
We are starting some work now to Universeodon. We're taking the content processing and websites offline to perform upgrades to the underlying database. We will then bring the site back online for a short amount of time while we complete some additional work ready to move the database to the new environment. Once we're happy that we're ready to move we will update again and take the site offline while the database is moved.
We are performing maintenance to MastodonAppUK now to upgrade to the latest version of Mastodon (4.4). We expect there to be a short outage while we restart servers at around 20:30 UK time (UTC +1) but otherwise maintenance will progress without impact to the community.
Maintenance fully completed.
The upgrade has now mostly completed. We're running through the last of the database migrations in the background but the site itself should now show as the latest version of Mastodon. We will continue to keep an eye for the next couple of hours to make sure everything is behaving as expected.
Due to a major email campaign currently being executed for Universeodon.com we are hitting our per-second limit of sending e-mails across the system. This is meaning there's likely to be a 30-40 min delay on e-mails being sent while we process all of the e-mails backlogged to be delivered to registered users on Universeodon.
We will be upgrading the Universeodon site to the current 4.4 version of Mastodon this evening starting at 7PM UK Time (GMT +1). During this time the site may be offline for short periods of time while we apply the upgrade steps. Further updates will be posted here during the course of the maintenance.
All maintenance is fully complete and the site is working as expected.
The site itself is now up to date, we're applying the last of the database migrations behind the scenes before we push some updates to search and our content processing queues as a new queue has been introduced which is not currently configured. We will update you as we progress or if there is any impact to service.
Maintenance has started, it looks like the majority of the work for the upgrade can be done behind the scenes and shouldn't result in an outage. We're running into a small problem at the moment which we weren't expecting but expect to be upgraded shortly.
Our primary router within our Redditch location appears to have gone offline. We are currently investigating the root cause of the issue. This will be having an impact to MastodonAppUK and Universeodon at this time.
Service is fully operational and queues have cleared.
We are continuing to monitor the length of queues, we have just under 200k jobs currently outstanding with around a 40-50 min lag depending on the specific queue. I'm reluctant to push the processing any further than it already is as it may start to negatively impact the site. I will take another look in the morning and make sure everything has fully recovered as expected.
Our colo provider has been able to restart our networking gear and we've restarted our services to get the site online. We are now clearing through a backlog of content processing jobs which are coming in from while the site was offline so there might be some delays on content loading through, we will continue to monitor while queues are busy.
We are waiting on an update from the remote hands team at the site to confirm the status of our equipment. Once I have an update I will share it with you all.
No incidents reported
No incidents reported
No incidents reported
No incidents reported
Universeodon has gone offline, we are investigating why.
Service has been restored, we have around 150k retry jobs which we hope are the majority of the jobs that were queued at the time of the outage but we know we have lost an unknown number of jobs we think mostly ingress jobs. We expect the site to be under extra pressure now for the next 12-24 hours while other servers on the fediverse re-try sending us their content and activity. We will close this incident now and pick up the existing main incident related to the original connectivity issues.
Our database server appears to have stopped routing traffic at around 1AM UK time, we are unsure as to why this happened and are aware it has happened previously. This has then taken the site down but also resulted in the loss of an unknown number of jobs to the server as our content processing continued to try to process and due to the length of time of the outage marked jobs as dead and not to be re-tried. It is impossible to know what jobs we lost or how many were lost, we are attempting to re-try the ones currently in the dead queue and have a huge backlog of retry jobs.
No incidents reported
We are seeing connectivity issues between Universeodon's old compute cluster and our new compute environment with part of our data storage, as a result the site is offline. We are urgently working to resolve the situation.
The planned mainteannce tonight has been cancelled and will be re-scheduled as the site is now working as expected.
We will raise a new incident here and announce via @wild1145@mastodonapp.uk and via server announcements when the new maintenance will take place.
We do still need to re-build the feeds for all users however we will pick that up under the existing incident on here.
Universeodon queues are now tracking at under 10 mins, we do still have a lot of retries to execute so there might still be some delays but it looks like we're mostly back to where we should be.
We will shortly spin up a new incident to cover the database move to the new infrastructure and we expect the site to be down for a few hours while we do this.
Due to the overnight database connectivity issues we have lost an unknown number of jobs which tried and failed to run due to DB connectivity issues.
Our queues this morning are difficult to fully process as we have around 152k retry jobs and a rapidly growing set of standard queues as well. We expect some fairly slow processing now for the next few hours but will continue to monitor.
Following this database issue and the fact it's happened multiple times in recent memory including over the last couple of days we will look to take the site offline for maintenance later today and migrate the database to a new server in the hopes we can remedy some of these connectivity issues.
We are now seeing around 50 mins of lag on the default queue and around 1 hour on our ingress and push queues. As a result we're scalling back our ingress capacity to prioritise the other queues and enable capacity to ensure the WebUI works as expected and content is pushed where it needs to go. We currently have a total of around 285k jobs queued across our queues.
We are seeing some reduction in speed and responsiveness on Universeodon however our intention for the time being is to keep the content processing ticking over at it's current rate. We are making good progress through our ingress queue and it's now down to around 3 hours of latency from live with around 184k jobs in the queue. Our default queue is currently lagging at just under 30 mins of latency from live with just under 110k jobs in the queue. We will review the situation in a couple of hours and if needed re-balance the queues again to re-prioritise the default queue to ensure timelines aren't significantly impacted beyond any existing impact.
Ingress queues on Universeodon are currently tracking at approx 4 hours behind real-time with the default queue currently slightly delayed at just under 20 mins. We've adjusted our balance of queues to prioritise our default queue and non-ingress queues as the ingress queue will often cause other jobs to spawn and can rapidly increase the job count on the other queues.
MastodonAppUK continues to look okay in terms of queue lenght and delays with real-time processing continuing.
All queues except the ingress queue are now back to real time processing. Ingress is currently around 6 hours behind live, we're going to try to increase the processing capacity a little further to hopefully clear through the backlog a bit quicker.
We are still going to need to do a feed re-build due to previouly lost Redis data however this will happen once the database has been migrated to the new infrastructure.
Queues on Universeodon are coming down slowly, we've again increased processing capacity as things appear to be running a lot more reliably now.
Ingress is around 7 hours behind live and the default queue (Timelines and other aspects) is currently tracking approx 20 mins behind live. All other queues are processing around real time and are within our expected ranges.
The queues on MastodonAppUK have now cleared and we're processing as-normal in real time.
We've just increased some of our queue processing again on Universeodon in the hopes that it will help work through the queues there now that we are more confident with our original content processing configuration.
We have reverted one of our original tuning changes which increased the number of content processor processes running by reducing the number of threads each had to match the original CPU configuration of the VM's. This appears to have made a noticable difference in performance and reduced the issues crossing the boundries between our old and new infrastructure. MastodonAppUK is now fully operating back on the original infrastructure platform with Universeodon seeing a 2x increase in processing capacity without any noticable impact to site / end-user performance at this time. We will continue to monitor.
We are seeing increasing queue sizes on Universeodon and MastodonAppUK with some intermittent issues on Universeodon's website. We have scaled again to a maximum safe point without doing any other work so are hoping to try to keep the queues somewhat managable. Ingress is currently tracking around 9 hours behind on Universeodon while all other queues are between 30 and 45 mins behind at this time.
For MastodonAppUK We are going to activate our legacy content processing nodes on the old infrastructure in the hopes that it will both work down the queue and allow us to temporarily disable the new content processing nodes, as our Redis and Database servers are still on the old cluster it should be as if nothing had ever moved and we hope will give some much needed capacity to the Universeodon service. Queues are currently around 40 mins behind on Ingress and 1 hour behind for feeds and default queues.
The majority of the stability issues have cleared overnight and with the majority of the MastodonAppUK backlog cleared the demand on connectivity between the two sites has dropped considerably. We're slowly ramping up the content processing on Universeodon to catch-up on ingress which was not running overnight and to try to keep on top of the MastodonAppUK queues.
We will look to migrate the Universeodon database server this afternoon / evening onto the new compute cluster as we hope that will help the overall performance of the environment and will allow us to route all Universeodon traffic locally within the new DC.
We have managed to get the site online after bringing the database servers connectivity fully online again. We've re-enabled a very small amount of content processing capacity however this is already proving to cause Universeodon to struggle. We're not sure why Universeodon is struggling more than MastodonAppUK however we're monitoring the situation and will have to look to expedite the migration to the new compute cluster for all other aspects of the sites.
We're now experiencing an issue with the Universeodon Database server which is no longer connecting to the network, this is preventing any sort of access and resulting in the site continuing to be fully offline. We are working as fast as possible to resolve the matter.
We have confirmed the issue is related to our SideKiq processing, we've now suspended all content processing and switched over to the new website infrastructure. We will start to bring back online the Universeodon content processing and then will bring back online the MastodonApp content processing to ensure stability.
We have reversed the change to our load balancer configuration and are re-routing the traffic back through our old infrastructure for the web services at this time as we are unable to bring the new server online for reasons which are not clear at this time. We will continue investigating.
We are continuing to have significant issues getting the new web infrastructure to serve public traffic and the root cause is currently unknown. We are working as hard as we can to restore service.
In an effort to stablise the site itself and to ensure events destined for Universeodon.com are delivered we have suspended the active processing of content while we re-build the content processing servers on the new compute cluster. Content will then be processed on our new infrastructure where we should be able to better catch up with the backlog.
We think we've identified the issue as relating to the traffic routing between our old and new clusters, where our MastodonApp.UK content processing still talks to our Redis server on the old environment and our Web and Content processing services talk to Redis on our new environment we think the delays to MastodonApp.UK and our work to increase processing capacity there has significantly impacted service here.
We are working now to migrate Universeodon content processing onto the new compute environment which we hope will minimise some of this disruption and will continue to monitor the situation.
We are currently migrating our Elastic instances to new infrastructure and as a result need to re-build our search indexes from scratch. This is unfortunately having some delays and is not running as expected and as a result full text search on the site is not currently operational.
Our search migration has completed and completed some time ago, apologies for the delayed update.
With the fixes applied to our core infrastructure we have restarted the process to populate our new search indexes. The process is now very much underway and search while not complete should be functional with the data currently ingested. It's likely to take another day or two to fully populate the remaining statuses which are flagged as being indexed.
We have had to pause work on the MastodonAppUK Search migration, we will look to continue the work in the coming days once we migrate the MastodonApp Redis and DB to the new infrastructure which we think should make the Search migration smoother.
Due to earlier issues at our colocation data center parts of our in-memory storage may have been lost resulting in delays to content loading and timelines potentially not being complete. We are investigating the issue at this time and will update when we can.
Feeds have been re-built, please contact our support e-mail if you experience any further issues.
We will re-run the feed re-builder once the database server is migrated to the new infrastructure as we're running into latency issues currently between our existing tooling which is preventing the job from completing.
Feeds do not seem to be fully functional, we are running a re-build in the background now and hope this will resolve issues for all users.
Due to earlier issues at our colocation site and interuptions to networking we are currently seeing performance issues and delays of around 40 mins to our content processing. We are working to resolve the situation.
Closing incident, further updates are linked to the Universeodon connectivity issues which has become the primary incident.
Queues continue to grow due to increasing latency with the Redis connection, unfortunately there is no obvious path forward at this point and issues with other aspects of the service are restricting my ability to properly resolve this issue.
All queues are currently backed up between 2 mins and 1 hour. We aren't processing events very quickly due to the issues with our Redis configuration. Depending on time We may attempt to migrate the Redis server onto the new infrastructure tonight however seeing the issues Universeodon had when we did this move it is not ideal. We will keep you updated.
We are continuing to work to minimise the queue lengths, unfortunately due to the connectivity constraints between our old environment and the new environment increasing our queue processing capacity has had a substantial and negative impact to Universeodon and as a result we've had to scale back our setup, we think the latency between our new cluster and old cluster is partially to blame for the reason the jobs are queueing and struggling to process however continue to investigate and will do what we can to minimise disruption.
We are seeing queues continue to grow and our content processing is struglging to keep up. We're rolling out some configuration changes which should increase capacity and we will continue to scale up our content processing until the backlog is back to a managable state.