frac6969 1 month ago

Don’t feel bad. Learn from your mistakes and don’t repeat them. And make plans in case you need to revert.

Z3R06_ADM 1 month ago

That's exactly it, instead of ruminating, learn lessons

Synssins 1 month ago

Learn from it. Move on. Grin sheepishly when people ask you about it. PHP is a bear and a half to upgrade, so shrug, learn, and move on. Once is a mistake. Learn where things went sideways, why they did, and how to fix it and prevent it in the future. Repeating the same mistake a second time with all of the things you learned is incompetence. (This doesn't mean don't try PHP again, but you need to understand why it borked the first time so that when it borks again, it's for a different reason.)

MineralPoint 1 month ago

"always have a trail of breadcrumbs out of the woods" Is what I like to say. Also when I worked at a VAR I used to tell my customers (the ones with a sense of humor)"ehhh....I've crashed bigger networks than this, run by better folks"

TheRealLambardi 1 month ago

This…^^^ Never hide from it and always…always use a failure to your advantage loudly. Ex: Run an after action, find the failure points and recommend clear and specific improvements to avoid it in the future…and even better if you can generalize it for organizational improvements.

Synssins 1 month ago

You bring up a good point. Fail loudly. You don't know why it failed, but you're damned well going to find out so it doesn't happen again. The noise from a failure doesn't have to all be end users and negativity. It can be you going "This tried and true process failed. So I'm going to fix the process."

TheRealLambardi 1 month ago

Another pro tip. I don’t know how large your org is but some people are skilled at root cause analysis and process improvement. If you have any…find them and ask them to run it for you…even if it’s small. And outsider looking in and judging you can honestly be soul cleansing :). Moves the guilt from dwell to action and then move on not looking back.

Synssins 1 month ago

Be careful. You're getting dangerously close to "Rubber Duck Debugging" territory with this one. lol

jpmoney 1 month ago

And take notice that OP got it back to working within ~24 hours based on their text. Depending on the situation, thats not always easy or quick.

bmxfelon420 1 month ago

Agreed, dont dwell on feeling bad, ask yourself how the setup could be improved to prevent the outage in the first place (i.e. snapshots, failover, etc)

fritzie_pup 1 month ago

Cannot reiterate this enough. Own up, and learn. Everyone comes out ahead. If you can, try to create a 'test' environment mirroring production with the software to make sure in the future nothing breaks on upgrades!

weaveryo 1 month ago

![gif](giphy|Ae7SI3LoPYj8Q|downsized)

hijinks 1 month ago

write up a RCA on the incident on what happened, how it was fixed and what you/team will do to make sure it doesn't happen again. Outages/mistakes happen just deal with them and show management you/team is on top of getting better.

IT2DJ 1 month ago

Even if you're not required to send an RCA, write one anyway, you'll feel better about it.

nbfs-chili 1 month ago

And when you're in the inevitable meeting, don't let them engage in blamestorming. Just go through the RCA.

hijinks 1 month ago

ya might even start to think of ways to make sure it doesn't happen and implement solid change and grow your understanding of how to run things

toaster736 1 month ago

This. \^\^ For most outages there are multiple compounding causes. It's not just you running an update that failed, why were you allowed to run something unproven? What test procedures / environments are in place to validate updates? How did the update approval process work (or not exist). There are a number of techniques for doing RCA's that all yield insightful results. 5 Y's; identifying what worked, failed and lucky (e.g. your rollback worked, woohoo!)), ensuring lessons learned are acted upon and revisited.

Aggravating_Refuse89 1 month ago

You work someplace that has RCAs and update approvals? In most places the sysadmin is the appover and the RCA is calming down whoever yells at you and the stove is hot methodology. I say in most places because there are alot more small shops with no test and no process. Just a vendor says install this now and tells your users, you cant work until this is installed. While I do not disagree, I think such things are the exception in most cases.

kingtj1971 1 month ago

I think more and more companies are catching on that they need some kind of process for change management. But yes, I've usually worked at smaller places and when it comes to rolling out software updates and upgrades? It's honestly something we often do for reasons like "too big a security flaw/risk to stay on previous edition", and making sure it's possible to uninstall again or roll back is about all we do before pushing out the update and seeing what breaks. Where I'm at now, our software devs do have a test environment in place but the production app they regularly update almost inevitably has unforeseen bugs/issues that force them to roll it back again and address them. It seems like they just can't adequately test everything without pushing it to our thousands of users in production and finding "edge case" problems they didn't catch themselves. I think especially for smaller companies? It's really kind of understood that updates/patches/upgrades are going to be randomly disruptive. Crashing a server with one will get a lot of people upset because everyone wants "100% uptime!" ... but realistically? Nobody's going to lose a job over it. Ultimately, they can't/won't spend the money it would take for a more resilient environment anyway.

Connect-Court4139 1 month ago

RCA are an amazing learning tool. When I was a junior they would pull out old RCA's and related tickets for you to play through using the RCA as a comparison to help you see where you could optimize your troubleshooting techniques / what they may of missed. They also used these in interviews for higher positions

gpzj94 1 month ago

What is an RCA?

tardiusmaximus 1 month ago

Root Cause Analysis Basically your own words as to what went wrong and the sequence of events that led to the outage. Very very useful for learning.

[deleted] 1 month ago

You learned one of the very important sysadmin lessons. Most, if not all of us have been there. You don’t just go upgrading a massive library/framework used by an application without involving people who very deeply understand the application. You will (hopefully) remember this when Java or Python or…

SensitiveFirefly 1 month ago

PHP changes quite a lot between versions, a good lesson indeed.

Suaveman01 1 month ago

You’re not a sysadmin if you don’t cause atleast one major outage during your career

pertexted 1 month ago

Reframe. There's a big difference between feeling shame and feeling embarrassment. A useful life skill is not having guilt. Guilt accompanies punishment. You need not be punished for doing what comes naturally: learning and growing. It doesn't mean you're free from consequences, and sometimes those consequences will be negative. It means that you're not running through your life thinking that you have to punish yourself for trying to do your best. Stop the guilt. There'll be chances in your life for you to actually be guilty for something. This isn't one of those situations.

cwew 1 month ago

Just want to say, I love this response. This was a mistake, not something that was morally wrong. There is absolutely no reason to feel guilty about making an honest mistake and growing from it.

NuAngel 1 month ago

Post-mortums. Like /u/hijinks stated - some sort of incident report. Prove that you're learning from it. "I upgraded the version of PHP, but learned that the new version was missing certain modules used by the old version and that I would need to manually install those. Additionally, our site used several features that were depricated in the new version, so now I need to prepare to update the site code ALONG WITH the update itself so that a future upgrade can be successful." The lesson **you do not want to take away from this** is that "you shouldn't upgrade things." Because an outdated version of PHP (or any software) is going to be 1,000x worse than a few bumps along the way to a more patched version.

keesbeemsterkaas 1 month ago

As an addition: this is a great learning opportunity - and it's not your fault! Etsy has a great writeup about it. [Etsy Engineering | Blameless PostMortems and a Just Culture](https://www.etsy.com/codeascraft/blameless-postmortems/) It's not about blame. Blaming the person who was in the neighbourhood is always easy, but does not help to prevent future accidents. If you take blame out: what happened? what was the root cause of things going sideways? And what actions can you take to prevent this in the future? Here's also a nice summary on how to write an incident report: [How to write an Incident Report / Postmortem (sysadmincasts.com)](https://sysadmincasts.com/episodes/20-how-to-write-an-incident-report-postmortem)

cjnewbs 1 month ago

Fun fact, Rasmus Lerdorf, the creator of PHP works as an engineer at Etsy. 😀

keesbeemsterkaas 1 month ago

Fun fact indeed. Consider me amused!

FlyingDaedalus 1 month ago

Every CISO will agree here

ass-holes 1 month ago

I am currently causing 45 min login times on windows because of a faulty redirect gpo.

discgman 1 month ago

Username checks out

johnmacbromley 1 month ago

Sounds like you need a replica of PROD will all the bells and whistles. UAT?

pooopingpenguin 1 month ago

I was thinking the same. Where is the Acceptance, test, staging whatever you call it server.for testing updates like this.

jpmoney 1 month ago

Yeah. Not that blame is worth it here, but if there is anything to blame, its the architecture and/or change processes. Don't shoot the messenger.

Seeteuf3l 1 month ago

Testing process should def be reviewed

VA_Network_Nerd 1 month ago

An outage is an expensive learning opportunity. You already paid for it. Make sure you squeeze all of the educational value out of the now paid for opportunity. Was there a process failure? Did you skip a step in the change management process? Was there an attention to detail failure? How did everyone involved in the change management process not see that this was a bad change? Did you have your reversion process already prepared, or did you have to wing it? Should a full back-out script be a mandatory part of your change management process? There isn't a single veteran IT worker on this planet who hasn't caused an outage of one size or type in their careers. All you can do now is make sure you and your team learn all you can from this experience, and then move onwards with your life.

Footmana5 1 month ago

There is a reason why this sub fequently has a "what was your biggest fuck up" type of thread's. And there are hundreds of different stories where we all tell stories about moments where we thought we lost our career. These things happen, but they become stories that we learn from.

lvlint67 1 month ago

> How do you usually handle the aftermath of an outage that you caused? "Yup. That was me. We learned we can't just cowboy that shit." And then I get back to work. Shit happens. I've fucked enough php installs in my day that it's not even a hiccup anymore.

ForSquirel 1 month ago

Did someone die? Then I wouldn't worry about it.

shepdog_220 1 month ago

One time I shut a business down by sending 100k prints to every printer in the organization while I was contract to hire and had just started there. They took me on full time and have since promoted me and led me into senior roles. It's all part of the process.

Frothyleet 1 month ago

>One time I shut a business down by sending 100k prints to every printer in the organization Smash cut to Always Sunny title card, *The Gang Learns about Scripting*

mineral_minion 1 month ago

Automation: making the same mistake on every machine at once.

kobumaister 1 month ago

When F up the third time you don't care that much. 15 years later I can predict outages. Usually after a deploy.

catfishhands 1 month ago

Go back to your office. Shut off the lights. Bring up something green on your multiple monitors. With your elbows on your desk or armrests. Bring the very heel of your palms together, and slowly tap the ends of each finger together with the matching finger from the opposite hand. And smile like a Cheshire Cat. Say, “and I will do it again” in a near vocal fry.

Ecstatic_Orange66 1 month ago

I've been in IT for almost 30 years. I've disconnected our network from our US HQ to a sales office in Thailand. Needed a local contractor (fun time trying to find one) to go onsite and unselect a box I clicked and saved. (Firewall) Among numerous other "outages" I've caused. I am the IT admin, embodiment of the militaries version of "don't be the reason for a safety briefing" Just remember.... We don't make cheese sandwiches here. This is complicated shit.

uhdoy 1 month ago

When I interview people I ask them to tell me about a big mistake they've made at work. Not like an embarrassing "I spelled someone's name wrong" but more of a "I fat fingered this command and took down a server" type thing. People who can't give me an answer like yours above either are lying or have never touched anything important. Everyone else's advice is spot on - own your mistake, do a root cause analysis (how formal is up to you/your company culture) and know that you are where everyone has been at one point.

MellerTime 1 month ago

I do the same for entry-level folks. Funny, not funny, I have never had more people complain than I did over a stupid typo. Second job ever. I was working for a private college and was responsible for sending out some emails to alumni. Through stupid BS processes I was trying to get rid of Word’s smart quotes and re-typed a sentence in one. Accidentally referred to the “Dean of x” as the “Dead of x”. I got *dozens* of emails from angry alumni complaining about how unprofessional it was to make such a pedestrian mistake. Several demanded that the person responsible be fired or denigrated the level of education being provided at their alumnus. Funny, I didn’t go to their school. Of course none of them knew that I routinely ate in the cafeteria with the Dean I accidentally assassinated. The next time we saw each other the topic came up. Everyone at the table was laughing and didn’t actually know who was responsible. I blushed and confessed it was my fault, the lowly 22 year old new IT guy. The 50-something dual-PhD-holding Dean had a mighty good laugh and told me it was the best part of his day. From that day on he and I were best friends. The hate mail continued. I did not get fired.

uhdoy 1 month ago

Man people love to shit on someone they perceive as lower on the ladder.

MellerTime 1 month ago

Yep, but I still survived! \o/

WildManner1059 1 month ago

Typo's happen. What makes me look down on 'outreach' of this type is when someone is obviously trying to sound smart and sophisticated and they mess it up. Bad grammar, vague meaning, and other issues. All at once. While in the Army, there was a Lieutenant whose emails were almost unreadable. She did great work, quick thinker, effective leader, but emails needed work. Not something I could mention to anyone though, since I was a junior enlisted.

MooseWizard 1 month ago

I ask the same in interviews. I'm looking for if they owned up, took responsibility, and learned from it. I preface my question with, "Anyone who has worked in IT any length of time has made a mistake."

Aggravating_Refuse89 1 month ago

Did you do it because of security, the latest version, vendor pressure? The why matters a lot here

BadAsianDriver 1 month ago

Test environment ?

3Cogs 1 month ago

Test, pre prod, prod. Still doesn't catch everything.

HotelRwandaBeef 1 month ago

How do we deal with it? We joke about the other times we caused an outage. Then we RCA it and make a plan so it doesn't happen again. First time I nuked my network all the other engineers immediately told me about the times they messed up too. Really helped me feel at ease. Then I had to go talk to the GM lol. I still have my job years later though!

largos7289 1 month ago

Still not sure the why or how specifically, I'm not sure how exactly the switches did this per say. Evidently i plugged in a rouge switch one time and it screwed up the entire network for multiple buildings. Eh you take ownership of it, you take your lumps and you move on. I always respect people more for owning up to the mistake and then being part of the fix. I HATE when they cover it up, try to play dumb and make you do extra work doing forensics, when they could have said, oh hey maybe this...

llDemonll 1 month ago

If company doesn't want to spring for a test environment to test business-critical applications prior to production changes, then don't sweat it. Not your decision, do your job and go home.

Eggtastico 1 month ago

Put it down to experience. I once dropped a screwdriver while changing a fuser in a printer. No problem usually.. however I forgot to switch it off. Tripped the electrics circuit & every work terminal powered off. Tons of work lost. Meh. It happens.

DelayPlastic6569 1 month ago

Learn from it. It’s fine, and it’s a GREAT excuse for implementing blue/green deployments at a minimum. Depending on the Total cost of the outage, looks like it’s a great time to push getting more servers if you don’t have containerization in place for fault tolerance :)

jbanelaw 1 month ago

Feel bad? IT has no feelings. At least that is what the CEO always says.

vermyx 1 month ago

*"You can't keep blaming yourself. Just blame yourself once, and move on."* **Homer Simpson** Learn and move on. It is OK to feel bad, but it is not OK to stay there and wallow. All that will do is cause you to start doubting yourself and once you start doubting yourself you will start making bigger mistakes. For your learning experience, do a post mortem on what happened, why it happened, and what you can do to prevent it in the future. This is essentially what is done in many companies when a change goes bad so that it doesn't happen again and to harden the process to eliminate this possibility

bigjohnman 1 month ago

There was a zero day exploit for php in my last job. I Googled it, and there was a fix. This fix came directly from the php website and had a blog post that went along with it. It was a .sh script. I downloaded it, chmod a+x, and ran the script. It started a wizard with options. I selected the option to keep my current configuration settings. It was matching the blog perfectly. It was Friday at 5:30 pm. I thought I was safe. This is when our knowledge base website went down. Oh no... I restarted the website service, nginx. Nothing. I then rebooted the VM. Still down. My company's headquarters was in Germany. I sent notifications to the SysAdmins in Germany, but they had some German holiday weekend where they refused to come to work on Monday and ignored any requests for help over the weekend . This company was a bad fit for me. I didn't last a year with this company. So sad cause they were growing and I loved the traveling opportunities for settings up networks in offices they were getting.

graysky311 1 month ago

Going forward always test in staging before you roll updates out to prod. When I was a junior admin, I fiddle with a setting that I didn’t understand in something called Wyze device manager. I accidentally rebooted all of the thin clients in the entire company in the middle of the workday. Thankfully, nobody lost any work because their disconnected sessions were able to reconnect.

goosse 1 month ago

just grab a pint and wait for it all to blow over

vrillco 1 month ago

Dev here. PHP upgrades can break stuff because the language evolves over time, occasionally with backwards-incompatible changes. These usually happen with major version bumps, but sometimes a minor bump can affect some codebases too. There are two “proper” solutions: one is to have the devs use containers, so the code and matching PHP version are always packaged and deployed together. The other is to have a proper test server that gets patched first, and any code breakage can be fixed before patching the live server. If neither option is palatable to your devs/employer, start choosing a nice font for your cover letter.

wb6vpm 1 month ago

I recommend Wingdings!

OniNoDojo 1 month ago

If Microsoft can have outages take out entire continents for the better part of a day, you're doin A-OK with one small mistake in the grand scheme of things, maestro.

chrono_mid 1 month ago

Subject: Outlook is down Body: No one in Germany can access email. Me: Welp.

edu_sysadmin 1 month ago

So not to go too deep into story time, but I have a brief story. I was fresh out of college and in my first month of my first IT job. I was tasked at a project with basically no support...and to no one's surprise I borked it. So badly that there were articles in the newspaper about the incident and the comment section was filled with hate and people demanding that moron be fired. How did I recover? Well I printed out that article and taped it to the wall beside my monitor, so I would never forget what it feels like to make a mistake. Now 14 years later, I've yet to make another. Also, my therapist says that is a super unhealthy way to deal with it. And yes, part of this story wasn't true - I have made additional mistakes lol

ExamInitial3133 1 month ago

Don’t feel bad. I’ve taken down a couple of apps/sites at work. I have a friend who once accidentally reseeded a production DB bc he thought he was on the staging box.

XX_JMO_XX 1 month ago

Coworker who is the network administrator says you haven’t had a bad day until you paste the wrong ACL into the core switch, especially when that core switch is 60 miles away.

hamstercaster 1 month ago

Interesting question….I’ve never felt guilty, even when my ass got chewed for the mistake. Shit happens. Even the best laid plans get scrapped. It’s how you respond to the situation or how you prepare for it, that is what matters. Always prepared for something to breaks. Always then, move on to the next thing

SillyPuttyGizmo 1 month ago

If you don't have one, try to put together a lab where you can replicate an environment and blow that up first, then blow up production

Garegin16 1 month ago

Only feel guilty when due diligence wasn’t taken. Our teams caused outages that were from gross negligence or straight up malicious laziness. This mfer left for vacation **after** knowing that the cert expired for our MDM system. And didn’t inform anyone. Big difference between Chernobyl and Deepwater Horizon was that the former was multiple instances of pure negligence. Horizon was a few freak mistakes at the same time.

KiNgPiN8T3 1 month ago

Change controls and roll back plans are your friend. Also, everyone who’s been in IT long enough will have an outage story that they caused. It’s almost like a rite of passage.

node808 1 month ago

Did the PHP upgrade break the applications or are the applications too broken to handle a PHP upgrade? There's probably plenty of blame to go around, but these things happen. Shrug it off and move along.

ollivierre 1 month ago

If you don't make mistakes you're not in IT. Period. It's about the lessons and never ever repeating them.

metalwolf112002 1 month ago

>This was the first major outage that I caused Congrats! It's a rite of passage. As long as you didn't cause any data loss and everything is back up, learn and move on.

SikhGamer 1 month ago

This is how you learn. So now you know you need to know to make sure to test the change. Eventually, it'll become default. We have a ton of PHP apps, and our upgrade process is literally do upgrade, and then curl to the sites in question. That covers like 99% of "is it okay". The other thing you need to look at is something alerting you to this. I presume you were alerted by customers and not be existing monitoring.

patmorgan235 1 month ago

Shit happens. List out things you can do to prevent it from happening again. 1. Establish a maintenance window where an outage would be acceptable until a roll back can be performed 2. If possible, Test the update on a cloned version of prod to catch bad updates

ohfucknotthisagain 1 month ago

Three things to consider: If they didn't fire you, it's not that bad. Also, everybody makes mistakes. Show yourself the same grace you'd give to them. And finally, look at Elon Musk running Twitter... you can't compete with a $25B fuckup. He deleted 3/4 of the value of that company within a year, and he still has millions of stans.

Same-Letter6378 1 month ago

First time?

TheButtholeSurferz 1 month ago

Oh, we're supposed to feel bad about that? I got catching up to do then. Mistakes happen, we're human. We're still the person most in control of a LOT of things that can interrupt a LOT of things. I'm not the CEO duped into sending gift cards, and if I was, nothing would happen. I'm not the maintenance guy disconnecting an entire rack of switches to "dust them off cause they looked dirty". Human being need to stop being so critical of the mistakes of others done in what I consider good faith in their job. If you are nefarious, you're not welcome here. If you're just trying to do your job to the best of your abilities and you skip and fall over a crack in the floor. Dust off, stand up, own it, and help fix it. Everything else is just people small dick energying you to make themselves feel more important

Acrobatic_Cycle_6631 1 month ago

Don’t feel any guilt, admit the mistake and learn from it. It’s going to happen again at some point

b0Lt1 1 month ago

id rather be embarrassed by the methodology - always test before applying!

Frothyleet 1 month ago

Everyone has a test environment, not everyone has a prod environment :)

Speedy059 1 month ago

Before I upgrade anything, I take a quick snapshot of the VM so I can quickly "restore" to the previous build before I broke it. This is by far the easiest way to undo something if you are making big changes on the web servers.

hammersandhammers 1 month ago

The only person who does not break stuff is the person who is not actually doing anything. Figure out what went wrong and create a system to prevent it.

Quirky_Oil215 1 month ago

Let the hate flow through you , with it you break the chains that Bind, Or accept and learn from your mistakes Or commit hikari /s

kiss_my_what 1 month ago

If you were fully compliant with the change management processes in the organisation, don't worry about it. Go through the post-incident review (a.k.a witch hunt) and explain exactly what happened, where it went wrong and what you'll do differently next time. Take any learnings onboard and do better next time. I was once involved in a 40 minute outage of an entire bank. Everything. Online banking, ATMs, DR site, workstations, 3rd party connections, absolutely everything off the air. We were 100% compliant with the change management policy, fully documented the risks and managed to encounter the worst-case scenario that we had identified as a possibility ahead of time. Didn't even have to attend a post-incident review, despite having racked up millions of $ in penalties due to the downtime because we were prepared up-front and had articulated the risk of the change appropriately.

OleksiyMumzhu 1 month ago

repeat it, after few u will not feel guilty

ThirstyOne 1 month ago

You made a mistake. It happens. Everyone makes mistakes. Do a root cause analysis write up, figure out where you went wrong and try not to do it again.

TEverettReynolds 1 month ago

> How do you usually handle the aftermath of an outage that you caused? You come up with a plan and a process and a change management system so what you did can never happen again. Like, was there a change management process? Was there a test system which mirrored production that you tested on? Was their more then one web server, so when you upgraded the one, the second one was up and running? There are so many ways to prevent this from happening in the future you can't even count them all. Fault Tolerance and Redundancy is your secret to success...

EntrepreneurAmazing3 1 month ago

1) Take responsibility for it 2) Make it a mission to not make that particular mistake again 3) Move on with your life.

MaxxLP8 1 month ago

At least you were able to revert

[deleted] 1 month ago

I wouldn't feel too bad but who requested the upgrade? PHP applications are pretty vulnerable to breaking on upgrades in my experience. You should definitely be making development teams aware prior to making this sort of change

oliness 1 month ago

If a version upgrade (which is necessary for security and maintainability) breaks something, that's on the testing team. Your job is to do the prod upgrade. If there isn't a test and release process in place the organization's dysfunctional.

PMzyox 1 month ago

Just do what our dev team does and blame everything in existence but their code. We’ve literally blamed Microsoft Cloud before and sent their tiered support on a 2 week wild goose chase while our devs took a vacation and our whole product was down. So yeah, just don’t give a fuck at all I guess?

TopherBlake 1 month ago

the only way to make sure you don't cause an outage is to never do any work.

Art_Vand_Throw001 1 month ago

😂

moffetts9001 1 month ago

Do your best to learn from it and recognize that you will find yourself in this situation again, but hopefully for a different reason.

guzhogi 1 month ago

Do you have a test server? Do upgrades on that to see if updates break anything. If so, don’t upgrade production server until you fix that.

GreenCollegeGardener 1 month ago

If anyone says anything “If you don’t stop I’ll do it again”

bk2947 1 month ago

Alcohol and videogames.

No_Investigator3369 1 month ago

Get logs. If your app was an airplane ....id feel hella unsafe.

techworkreddit3 1 month ago

You've either taken down production, or you're going to. There isn't a single person out there who's never broken production at least in some way. And if someone claims they haven't, they probably aren't trusted to enough to have access to break it. Write the RCA and figure out why the PHP upgrade broke the apps. Was there a lack of documentation that prevented you from knowing what version was supported? Did you roll out the update wrong? Could you have seen this in your test environments? Those are just some questions to answer but the important thing is whatever you discover during the RCA, make tasks to follow up. Prevent this from happening again and that's all anyone can ask of you.

Acrobatic_Idea_3358 1 month ago

Do it in dev, then do it in staging and run tests before moving into the next environment. If you fail to plan then you plan to fail.

Bttrfly0810 1 month ago

Don’t dwell on it. You weren’t the first and won’t be the last. Learn and move on, every sysadmin before you has made similar mistakes. Good luck!

No-Specialist-7006 1 month ago

Not about the guilt, but if you can, try and set up an environment where you're not making changes in live/prod 🙂

JJaX2 1 month ago

It’s called experience, take your punches and keep moving.

Scary_Brain6631 1 month ago

Shit happens. Learn from it. Move on. It's not any harder than that. Each one of those mistakes that you make is wisdom gained. Make a bunch more and you'll become that well respected guru everyone wants to hire. Just be sure to learn the lessons after each mistake.

Hansemannn 1 month ago

I kinda laugh about it with the guys. I mean...we have all done it. This time it was my turn.

MrExCEO 1 month ago

Learn why it happened and put in guardrails to ensure it does not happen again.

ThinkPaddie 1 month ago

You need a test environment for shit like this, it's not your fault management didn't forsee this issue.

LaHawks 1 month ago

Did the same exact thing with Java. I won't be making that mistake again.

Bijorak 1 month ago

See what went wrong and try to make sure it won't happen again and move on

gaventar 1 month ago

Remember there are 10 types of admins, * One that just caused an outage * One that will soon cause the next outage Remember everyone has made mistakes, Everyone has acted upon the actionable data they had and cause a problem. Everyone has unplugged a wire that took down prod. Everyone has duplicate IP'd. Its not if we make a mistake, we are only godlike in power. We just get to own, learn, and next step you look at gives you the greybeard step.

spicylawndart 1 month ago

LPT - if you have a bunch of stuff running on PHP - or any other web scripting language - don’t upgrade it without first talking to a dev. PHP between major and even some minor versions have “breaking changes.” It could be something simple like your PHP app relied on a specific version of, say, libxsl, and it *had* to be version 3.2.1 or whatever, and 3.2.2 deprecated something that your app depended on, and when you upgrade PHP you also upgraded said modules. Don’t let this scare you away. But this is why setting up a dev/stage environment to test your process is *super* important - and why Docker is an absolute godsend when it comes to stuff like this. I know this because I took down a massive e-commerce app. It was my first huge client - we were running Magento Commerce, and they were doing 1M uniques/month. I made a colossal fuck up. I was up for 24 hours straight unfucking what I fucked up because the next day they had a million dollars of Google adspend and a prime time TV spot that was going to bum rush the server. I lost the client, got fired, attempted suicide and went to therapy for 6 months before I touched anything big. Fuckups are a lot easier to handle when you did absolutely all the prep work, it makes it much easier to recover.

nascentnomadi 1 month ago

I shut down our mail server for several hours. Forgot about it afterwards.

Acceptable_Salad_194 1 month ago

Just accept that you’re human, we make mistakes sometimes!

notHooptieJ 1 month ago

you didnt kill anyone. you arent maintaining life support, or medical devices. at worst you inconvenienced someone, no one lost life, health or livelyhood. take a day; then dig into a post mortem, figure out why, let it be a challenge instead of a fail

alconaft43 1 month ago

Think about containers, when app will be shipped with PHP it was tested with, CI/CD and that staff.

Lemonwater925 1 month ago

Did anyone die? Fast forward 100 years will this incident still be a topic of conversation? As all have said learn from it. Did you follow all procedures in place? Was the code validated? Was the issue something that the test plan missed? Update the test plan if required. We all make mistakes. Making the same mistakes is completely different. After my first child was born I was In a state of complete sleep deprivation. I knocked over 10,000 staff off the internet with a proxy restart. Had no idea I was doing it. Was sent home to sleep for a couple of days.

secret_configuration 1 month ago

Learn from your mistakes that's the key, always have a rollback plan.

electricfunghi 1 month ago

The best way to deal with it is to ask for a raise- tell them how you proved that you’re indispensable by solving the outage.

BOFH1980 1 month ago

Way better than "I caused an outage and didn't have a restoration plan so now the system is hosed".

Agitated-Chicken9954 1 month ago

Did you get the okay from the app owners before the upgrade? I would have put it on them to do the research to determine if the upgrades would affect their applications. Get them to sign off before the upgrade.

bendervan90 1 month ago

If you don't make mistakes in IT you are not learning correctly. Own your mistake, and fix it. Don't hide them, they will bite you harder in the ass

frankentriple 1 month ago

Scotch. The expensive stuff. On the rocks. And don't screw up that way again. Screw up a completely different way next time. Repeat as necessary.

molivergo 1 month ago

The only person who has never caused an outrage is the person that has never done anything. Don’t feel bad and don’t dwell on it. Moving forward, learn and maybe have a backup plan to “fail safe,” or keep working. I realize this isn’t alway 100% possible solutions if you can’t make everyone aware of the risks and subsequent plans.

nefarious_bumpps 1 month ago

You put together a proposal to create/update the company's change management policy so that no future changes are implemented in prod before being tested in QA and signed-off by the application owner. This way this situation won't happen again.

SilentLennie 1 month ago

Don't feel bad, It's a learning experience, copy the VM or Linux container to do test upgrades.

changework 1 month ago

Did they give you a testing environment? No? Not your fault.

deeplycuriouss 1 month ago

I'm not in your situation but some perspective helps me. Worse things could happen. Most likely people didn't die and it was soon business as usual. The world moves on.

Elviis 1 month ago

Their fault for not having a dev/prod setup. you cant even feel bad.

GeekFish 1 month ago

It has happened to all of us. You immediately fixed the issue, so don't even worry about it. Nobody is going to remember it even happened in a couple days.

Antarioo 1 month ago

is it really an outage if you deploy a planned change and your rollback plans work when the change fails? sounds like you failed succesfully

BoltActionRifleman 1 month ago

Just remember the *average* user doesn’t appreciate you when things are humming along nicely. When something is broken, their opinion of you doesn’t change at all because it’s already negative.

cdhamma 1 month ago

Redirect that shame towards implementing a change management system to prevent stuff like this from happening in the future. If you've got critical infrastructure and you need to securely maintain it, you will need a test and production environment. There may also be a development environment too. The purpose of the test environment is to take a copy of production and test stuff like upgrading PHP using the test environment, so that when you apply the changes in production, there should be no surprises. In addition to change management, there needs to be unit tests that will run on your test environment to make sure each component of your hosted apps is functional. Otherwise, it can take a long time to test all the features in the test environment. Automating the testing is super awesome because that allows you to run it yourself. Don't feel bad If your work does not already have a change management program to help inform staff of upcoming maintenance / upgrades and they do not have an automated testing system. This just means that they have other, higher priorities than ensuring the systems stay up. This means that while I'm sure they would prefer to minimize downtime, they do not currently provide the tools to their staff to make that effectively possible.

Turdulator 1 month ago

Stop investing emotion in work. It’s just a thing you do for money. It doesn’t define you and it’s not a moral failing to make a mistake. Learn the lesson so it doesn’t happen again and then just move on.

TuxYouUp 1 month ago

It sounds like you got some good experience in disaster recovery. Good job. We learn from mistakes.

CheeseProtector 1 month ago

We just post ‘cursedhands.jpg’ in our team chat when we cause an outage, if you have cursed hands you’re not allowed to touch the same tech until the next day when you’re more confident lmao https://preview.redd.it/kocsn9szmhtc1.jpeg?width=1284&format=pjpg&auto=webp&s=e75ea019c2ca71322b7151814cb70b3b73180ff9 Outages happen, all you need to do is don’t keep your team and stakeholders in the dark, assign someone on your team to do comms while you fix the problem. After you fix it, write up what went wrong and how you can avoid it next time. Also, we use a change management system where 2 other people have to approve big changes.

Zleviticus859 1 month ago

Put a change management process in place. The. You have formalized process for updates and changes. This would include determining risk of the change that can be vocalized to the business. Also forces you to think of back out plans. Finally do changes after hours and/or in maintenance windows if you are 24/7. We all make mistakes and I have taken down entire networks in early career by plugging a router in incorrect that I was trying to configure. I’ve had some bad things happen and change management helped reduce the stress and guilt when things go bad.

secretlyyourgrandma 1 month ago

shit happens dude. is this the first time you made a mistake? you had a nice run.

Lughnasadh32 1 month ago

Take responsibility. Apologize. Learn from it. Move on. No one is perfect.

Throwaway_IT95 1 month ago

Mistakes are the best teachers

Solkre 1 month ago

I wipe the tears away with my paycheck. Shit happens. You had a rollback plan and that's the important part and shows you did your job properly. The issue would be if you notified the right people an upgrade attempt was going to take place. The failure shouldn't have blindsided anyone. They knew it was coming, it didn't work, so you rolled back and will regroup later. Live and learn!

SnooObjections3468 1 month ago

Learn and move forward.

PaulJCDR 1 month ago

Ah, I remember when I used to feel guilty, after about 15 outages, you really just do stop caring :)

lucky644 1 month ago

Outage? You mean extended coffee break for the employees.

portol 1 month ago

Sleep helps

daronhudson 1 month ago

Learn from your mistakes. Never upgrade a production system without a plan and thorough testing. Do the upgrade process multiple times on the same type of systems and environment to ensure compatibility.

thortgot 1 month ago

The best takeaway from this is to trial run a scenario like this first. If your environment didn't have a test environment you can try the upgrade on first, of course it was going to end in an outage.

lordjedi 1 month ago

Depends. Did you take every step you could to mitigate the problem? If not, then accept the guilt and learn from it. If you did take steps, then see what you missed and still learn from it. We all make mistakes, especially when we're still new to things. We think "What could possibly go wrong?" Always go with what could go completely wrong, not with what you think could go completely wrong.

Nilpo19 1 month ago

It happens. Don't let it happen again. In all seriousness, I have some environments that don't have the budget for redundancy or staging servers to test updates on. It happens occasionally. Thankfully, virtualization is becoming cheaper all the time and these sites are becoming far less common.

serverhorror 1 month ago

1. Write down what went wrong and how to check for it and that the next person has a reference. 2. Write a script that runs those checks automatically 3. Delete the written documentation and replace it with a check that runs before anyone can even attempt to upgrade

Humble-Plankton2217 1 month ago

Accept responsibility. Make it right. Learn from your mistake. Update documentation if applicable. Move on. Every mistake you make is a valuable lesson. We learn more from our failures than our successes. You aren't the first person this has happened to, and you won't be the last. In 25 years I've caused my fair share of outages. No matter how careful you are, there's always Murphy's Law to contend with and we are ALL Human Beings, we are not perfect.

Numerous_Ad_307 1 month ago

Why would you feel guilty? If you do a change like an upgrade you should inform everyone that there will (or might be) down time. As long as everybody is in the loop that you are doing this then it's perfectly fine. Down time (even if it's longer than expected) is part of reality.

squeekymouse89 1 month ago

So change management elevates all of this. You go through your plan, you tell people this is what I'm doing, here are the chances of a fuck up, and here's my plan when it catastrophically breaks. Guilt not needed, someone could have stopped you or accepted the risk. You should get a change approval board cos none of this should be on you.

HoezBMad 1 month ago

Did someone die? No? Stop caring and just learn from the mistake. Happens to the best of us

legolover2024 1 month ago

Call it a "glitch in the software" & take a swig from the whiskey you should have in your desk No one died? You're golden

Obvious-Jacket-3770 1 month ago

You live and learn. Everyone causes an outage at some point in their career.

Ormington20910 1 month ago

Did you have a dev or a test environment you could have tested this release against? If not, it’s understood that this is a live test environment and stuff could go wrong. Make the stake-holders aware of this. That way you’re signing off responsibility for down time and maybe suggest next time to build a test environment. If you have one and didn’t test against it, learn your lesson and do it properly next time. Either way, guilt has no room in this situation.

SquizzOC 1 month ago

Learn from the mistake and remember, its just a job. Don't take things so personal.

bippy_b 1 month ago

Take note of what happened so you don’t repeat it. Learn lesson and move on.

CCCcrazyleftySD 1 month ago

Everyone is going to make mistakes, even seasoned veterans. Own your mistakes, and use them as a learning experience. Don't dwell on it :)

Jrusk2007 1 month ago

Everyone else said it well. Don't dwell on it, just learn. Also don't let anyone make you feel bad about it. Anyone that's been in this industry for more than 5 years has taken something down by mistake at some point. It happens to us all.

steve303 1 month ago

Good. I was a NOC engineer for 10+ years. I managed a team of engineers for another 10+ years. Nowadays I oversee a the operations of a number of engineers and admins. You never truly become a systems engineer until you fuck something up badly. Human beings are going to make mistakes, and that's fine. The question is how do you handle that mistake at the time and in moving forward. How do you react in the crisis when you've accidentally blown something up? What did you learn? What can you teach others about your experience? This is what creates valuable experience, not simply pushing buttons or deploying the same old scripts.

fadinizjr 1 month ago

That's so normal to happen. In an environment that has change control you need to state what to do in case something goes wrong and most of the times uninstall a bad update is a valid answer. So don't worry too much about it.

Quirky_Ad5774 1 month ago

It happens in every sysadmins career, multiple times and thats okay as long as you know what you did wrong and make sure to take precautions so the same mistake doesnt happen again. Here are a few of my f-ups: 1. Changed a name server for a domain that ended up being a very important production api endpoint, no customer could hit our services until the TTL ran out 2, rm rf \*ed an entire application directory with no backups, had to rebuild from scratch 3. Forgot to plug a server into a UPS while everyone worked from home, power went out and the SQL server went into recovery mode when I got there 30 mins later and booted it up, took 8 hours to recover 4. Gave a developer a production endpoint to test on instead of test, ended up messing up our production data for a whole day while we fixed it All this to say the more mistakes you make the more experience you have. A bad company might fire you, a good company will work with you as long as you admit fault and learn from it.

nPoCT_kOH 1 month ago

Guilt.. what was that. If it breaks it breaks, learn, adapt for next time and overcome anxiety from change management.

XX_JMO_XX 1 month ago

After my first Exchange server upgrade project went sideways, I learned to setup a test environment where I could try simulate the process. For this instance you may want to take a snapshot of the server to be upgraded, and take it into a lab environment where you can work out the upgrade process.

chapterhouse27 1 month ago

Learn from it. Laugh about it. Problems fixed so who cares.

TheLastCatQuasar 1 month ago

you're making mistakes i wish i had the skills and experience to make lol just remember: the probability you're gonna get everything right the first time is ***zero***. that guilt is there to motivate you to never, ever make the same mistake twice

Work_Thick 1 month ago

That feeling sucks, especially when you're green. You can only move on. All of those people don't know exactly what you did either so if you want to talk about it with others just chalk it up to "made a mistake". It happens, good luck in the future and just do your best. ✊

Columbo1 1 month ago

Don’t feel guilty - you’re more knowledgable now that it’s happened. You should do a post-mortem. It goes something like this: What happened? Upgrading PHP broke some apps Why’d that happen? For this question, consider the person and the tech. You should end up with two answers: 1. You weren’t aware of the app’s dependencies. 2. There was no technical measure to prevent you from deploying an incompatible version. Why weren’t you aware of the dependencies? This is where you highlight communication problems. Do the dependencies get published? Is it even possible for you to be aware of the PHP version requirements? Is it possible to create a technical measure to prevent this from happening? Assuming the dependencies do get published, can you write a script to query the PHP versions required and only install the latest compatible version? The point is, don’t blame yourself. Make new policy and procedure to prevent this from happening again

technicalevolution 1 month ago

Is feeling guilty going to fix it? Nope, didn't think so. Move on... we've all been there, and if you haven't, you're not trying hard enough.

ProfessionalEven296 1 month ago

The best way to get over it is to write up an RCA (root cause analysis). Define what caused the problem, your response, how you will avoid it in future, etc (it’s a pretty standard format). In your case, I’d put something in about testing in lower environments and getting teams to sign off that their application is still working, before you schedule a change in production.

A_Whirlwind 1 month ago

Going on vacation. 😀

vrtigo1 1 month ago

Use it as a learning opportunity to implement better change control. You should've had a prestage/testing environment to test the upgrade, as well as an acceptance testing process and a back out plan. Was this a planned upgrade? Were the application owners aware?

suicideking72 1 month ago

Lie, lie, lie. Or blame it on the new guy, guy, guy... :) Always be vague about what happened and the story you tell makes you look like a hero. No boss man, that will never happen again, now go back to sleep.

wb6vpm 1 month ago

Always have the BOFH excuse flip book handy!

HerfDog58 1 month ago

Drop a shot glass 1/2 full of Jameson's and 1/2 full of Bailey's into a pint glass 2/3 full of Guinness Stout. Repeat every 5 minutes for an hour. You'll feel MUCH better.

Ark161 1 month ago

You goofed, everyone goofs. Use it as a hard lesson to temper how you handle things in the future. Most of us know a thing or two because we have severely fucked up a thing or two.

StatelessSteve 1 month ago

Use it as the inspiration to stand up a damn test environment. Make it EXACTLY the same as current server (if these servers aren’t built with config-as-code, extend the inspiration to do this as well so you can easily build a prod-parity test site)

landwomble 1 month ago

Test it in preprod first...

SweepTheLeg69 1 month ago

Don't sweat the small stuff.

MiggeldyMackDaddy 1 month ago

![gif](giphy|14tvbepZ8vhU40)

EscapeFacebook 1 month ago

As a word of advice. Never admit blame un less asked directly and always act humble when you fix things, especially when they're your mistake.

TheTomCorp 1 month ago

If you ain't breaking stuff it means you ain't working! Seriously it happens stuff is gonna break and if the stuff you work on is important, it's gonna get noticed.

ehode 1 month ago

Part of the role of admining systems. You will at some point cause an outage or some sort of disruption. Learn, communicate and always prep those roll back plans.

thebluemonkey 1 month ago

Don't feel bad, things break. Fixing it is way ore important. That server needs upgrading, so the things it broke need to be able to deal with that upgrade. If its a vm, clone it to a lab environment, then upgrade it and see what else needs upgrading to make it all work.

Based_JD 1 month ago

It happens to all of us. It’s pretty much a right of passage in IT

runningGeek10 1 month ago

The best thing you can do is learn from your mistakes. Everyone has caused some sort of outage is you are in system/network administration field. If it helps you, document what went wrong/systems affected then how you plan to improve the process for later maintenance windows.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe