after the database had a hissy fit
Yes! We’re back. And we’ve moved to a much faster server. I think it’s happy
The much faster server has been pretty quick to remove some posts! I just noticed one of mine from a couple of days ago has vanished.
The water is never safe - but its often a lot of fun …
I’m very sorry about that. I broke the site: here’s how.
On Tuesday night I got an alert that Discourse (the forum software) had a critical security update. I checked that our sysadmin was on hand in case it broke during the upgrade, made a backup of the database, and ran the upgrade. It seemed to go OK, even though it was a full version update.
However - somehow I stuffed up how Discourse sends email. Maybe something changed between versions and I was supposed to go in and update some settings. But whatever it was, overnight the forum tried to send some email, couldn’t, and got itself into a state.
By the morning it was clear that the forum wasn’t working properly. I hadn’t yet figured out that it was the email configuration, but it was clearly failing. We looked for a quick fix - fortunately we had another version of the system running on another machine. We updated that one, loaded in the backup, and it seemed to be working.
But we hadn’t fixed the underlying configuration issue, so it just broke again.
This morning we realised that it was time to stop making the same mistake. As it happens we’ve recently been trying out a new, much faster web host. Since everything was broken anyway, we figured we should start again from scratch.
We set up a fresh new server on the fast host, gave it double the RAM we’ve been working with, and put in a fresh install of the latest Discourse. We configured everything carefully as we went. Once it was definitely working, we loaded in the backup and pointed the choice.community domain name at the new server.
Domain name changes can take a while to go through (different durations for different people depending on their ISP). But this seems to now have gone through for most people. We’re back
The upshot is though that if you posted anything while the site was sort-of working, after I ran the backup, then it’s lost My sincere apologies, and next time we do a major update we’ll test more thoroughly before we declare it complete.
No worries, glad it is all sorted now
I like Discourse, it even shows you when someone else is replying, I can see Syncretic is about to post.
If I understand you correctly you did a hardware and a software upgrade together for a public system with thousands of users that is online 24/7 and didn’t have a migration and testing plan, and you didn’t have a way to roll back.
Out on Fairy Bower on a big day novices might jump on the first wave they saw. If they emerged they were described as having Big Nads. If not you helped get them and their broken board back to the carpark.
Even Microsoft with their multi-level testing regimes and always reliable software (/sarcasm), and $billions of cash flow they do know what to do with, and having significant in-house expertise, have taken computers down with automated windows updates, as has more than one of the AV vendors over the years.
You might re-read @viveka’s post and rethink your oversimplification of what started as an unfortunate ‘oops’ and went worse from there. Choice staff are certainly not the first nor will be the last to be tripped up where the underlying cause of a problem is not so obvious.
Choice have likely tightened up their internal procedures accordingly now, and good on them for letting us know what happened.
Great to have you back… Did you know you can make domain relocations easier and faster by having more control over your DNS records?
Isn’t that when you roll back to the previous state while you work out what to do next?
Do you also ring your bank and advise the CTO/CIO that when their systems are down for hours and sometimes days before they are fully operational again? It is always so easy, isn’t it?
Good to be back.
I thought that there must he upgrades going on but assumed that it may have only been discourse, as it has been showing critical update required for some time.
Just a thougt for the future, if the same happens again it is easy to create a webpage very quickly which loads when one tries accessing the Choice.Community IP stating that the website is having problems, under repair or being upgraded so that would be (and current) members don’t think that it has disappeared off the planet for ever.
I never said it was easy. Quite the reverse which is why conversions, upgrades and migrations require careful planning and a fall-back position in case Sod’s Law takes effect. This is the second time you have missed the point, let’s leave it.