Mementos from bygone software releases

September 13, 2022 devopslamb

Back at the height of traditional enterprise software development (most of the 1990’s), major releases of software took 18 months to two years in the making. That time was made up of requirements gathering and vetting, design documentation (typically both high level and low level), design reviews and approvals, coding, multiple test phases, an Alpha and Beta test phase, then some sort of final quality sign off before release. Many would argue that the software didn’t actually get released, it escaped. 😉

I’ve written before of enterprise software test cycles. They would often involve multiple styles of tests, from unit tests, automated functional tests, to long-running system and performance tests. “Long runs” would typically last for 5 days or more, during which we were cautioned to not go anywhere near those servers for fear of disrupting the test and needing to start over. Clean tests were required before moving to the next phase. We know now that’s a terrible idea, that such testing is actually an anti-pattern to how the real world works.

All of these processes were organized in a waterfall fashion, now the anathema to agile practices and iterative development. We know today this is a bad idea, but in the 1990’s this is all we knew. The cloud didn’t exist yet to drive better efficiencies. Agility was something you strove for in recreational sports, not software engineering.

And besides working 2 years on a new release, we had to also support the releases that were already out there, possibly several different releases, and their associated code streams. All of this is a labor and cost nightmare, we know now. Then, it was a labor of love and the drumbeat rhythm that defined traditional enterprise software development. Yeah, it may have been hard, and costly, but it worked, time and time again. We put out truly great software (and still do, of course).

Once software releases actually did escape – er, release – we would celebrate as a team. I remember those celebrations well. We all worked in the same location, so massive parties were commonplace. We’d give out plaques like you see below to commemorate the event. I was cleaning out my office when I stumbled on these, and they made me sit down and recall those great memories, and my friends that I shared them with. We don’t obviously release software the same way anymore, and as such we don’t celebrate our achievements nearly as often as we used to. That’s a shame.

Newer engineers to my organization can hardly imagine what software development was like back then. Truthfully, many of them hadn’t even been born yet! They can’t figure out how we ever got anything done. The practice of tracking software releases using version, release, modification and fix (VRMF) values has given way to date and even time stamp tracking for continuously deployed updates. So if you ask one of these new, more agile engineers what release is live in production, they’ll likely say “Tuesday’s.” Some of my now deceased senior technical leaders from back then would surely roll over in their graves upon hearing such a thing.

The Power of a Good Listener

May 17, 2022 devopslamb

My mom was a great listener. She had that way of making you feel like the most important person in the room. People loved talking to her, because she was genuinely interested in what you had to say. You probably know someone like that, and know how great it feels to be with that person. Effective listening not only improves lines of communication, but it endears you to people. It makes people actually want to be around you, work with you, seek your advice and counsel. So, effective listening actually helps your professional development in that it inherently helps build your network and grow your sphere of influence.

Unfortunately, most of us are not good listeners. We are in a constant mode of vying for airtime, struggling to be heard in a sea of voices, which makes it difficult to afford ourselves the patience to be quiet, and listen. That’s the only way to learn something new, by the way – listening. You can’t learn while talking (unless you happen to learn something new about yourself). I think we all can relate to being in meetings where there are lots of opinions being shared, and you have your own just waiting for the opportunity to be voiced. So, you do what I call “listening to respond” instead of listening to understand. You wait for that split second pause while someone is catching their breadth to butt in with your idea, only to find that that was just the idea just discussed. You weren’t actually listening.

Think about someone you really admire that you work with. Perhaps a technical leader or executive. Now think about their interaction style. Do they talk over everyone else? Do they spout an endless trail of thoughts without pausing? Or do they patiently listen, and respond thoughtfully and succinctly? People tend to tire of listening to “blow hards” and often don’t admire them much. But when thoughtful, good listeners talk, they tend to be listened to, and followed more closely. I think we all want that.

But there are two other really good business reasons to be an effective listener – one I always knew and one I learned recently. First, as a technical leader, I learned early the importance of listening to the customer first, to really understand their concerns and challenges before offering solutions. I’ve seen way too many sales pitches where the solution was presented before the customer’s needs were fully known. “The answer is X. What’s the question?” Really listening to the customer first not only better prepares you to help build a solution, but it again endears you to the customer. It is a great way to build trust and authenticity with your client. It also positions you as a Servant Leader, which is a rare quality in leadership. Servant Leaders typically start a conversation with “How can I help you?” When was the last time one of your leaders asked you that?

The other good business reason is in the area of Situational Leadership, where you lead according to the situation and people involved. I recently co-facilitated a technical leadership development course where the lead facilitator introduced all of us to the notion of Whole Brain Thinking. This concept was developed by sociology researcher Ned Herrmann, who basically broke down human thinking styles into four categories: Analytical, practical, relational and experimental. As individuals, we normally align with one of the four, but occasionally exhibit characteristics of the others. Mr. Herrmann defines Whole Brain Thinking as “the awareness of one’s own thinking style and the thinking styles of others combined with the ability to act outside one’s preferred thinking style.” He postulates that if you can understand the thinking style of another, and emulate it in your communication with them, you’ll be that much more successful in your intended outcomes. Therefore, an added benefit of being an effective listener for your customer is that you afford yourself the opportunity to learn what thinking style they have, and structure your solution accordingly. Do they want all the facts and numbers (analytical), or do they want to understand schedule and cost (practical), or do they want to know the impact it will have on their users (relational), or are they looking for the big idea, the big picture (experimental)? Listening first allows you to find ways to relate with your client, and for them to relate to you.

Effective, genuine listening is very hard. It takes infinite patience, and a suppression of that fear of being left out of the conversation. Yet the rewards are great, both interpersonally and professionally.

Hybrid cloud redefines enterprise software

October 25, 2021 devopslamb

IBM announced its hybrid cloud strategy several years ago. With this strategy, together with the acquisition of Red Hat and its OpenShift multi-cloud software delivery platform, IBM embraces the fact that IBM Cloud is not the only iPaaS platform out there, and many of our customers have chosen other platforms to run our software. Makes sense. Let’s help our customers do business where they wish.

If you read between the lines of this strategy, IBM is placing a renewed focus on traditional enterprise software, which in my world means “you buy the software and run it where you want to.” This is often also called the “on-prem” (on-premises) software model. This is a 180-degree pivot from what IBM’s strategy has been, which has been all about cloud – our cloud, and software-as-a-service (SaaS) delivery, a trajectory which IBM has been on since about 2008.

But this pivot is not an “instead of” model. We are not forsaking the SaaS delivery model. Many of our customers still want to subscribe to software-as-a-service. But for those customers wanting to manage their own software solutions themselves based on our enterprise software, the hybrid approach gives our customers the flexibility to decide where that software runs, whether it is indeed on-premises, in IBM Cloud, or any combination of cloud vendors they choose.

So now that the marketing pitch is over, I want to take a step back and reflect on the profound affect, and pressure, this hybrid cloud strategy has had on software development and delivery.

Looking back in time to the 1990’s and early 2000’s, I was part of several enterprise software development teams, where we would put out releases and versions about every 18 to 24 months, on CDs and DVDs. In fact, I remember distributing my first software product, IBM SNA Services for AIX, on 3.5 inch floppy disks! Man, those where the times. Back then, the emphasis in development was the business function of the product – ensuring it did what it was designed to do, solving the business challenges it was meant for. Far less emphasis was placed on the installation, management, and general lifecycle ownership of the product, and therefore these aspects of the product unnecessarily grew in complexity. It’s embarrassing, really. Not that we didn’t care about those things – I personally cared quite a bit, and framed much of my career around plumbing instead of painting. It was just that, well, administration and operations were the customer’s problem, not ours.

Obviously, this is a failed view. Higher total cost of ownership (TCO) of a software solution erodes its value to the customer. I spent many support situations with customers, personally, because our software was difficult to manage and debug when something went wrong. Then along comes cloud.

In about 2008, IBM told all of us we were to take our software products to the cloud, and deliver them as SaaS solutions. Suddenly, we started feeling the pain our customers complained about incessantly. We had to deploy, administer, update, and troubleshoot not only our own software, but all the other middleware our software depended on as well. It was a nightmare early on, to say the least. Non-functional requirements (NFRs), the term we used for operational requirements that were not necessarily business facing, were rising in priority. We witnessed that even though we demonstrated the ability to make money in SaaS, the TCO of the solution eroded our profit margin to the point where something just had to be done.

Then along came Docker, then Kubernetes, and ultimately OpenShift. Together with the DevOps and SRE (site reliability engineering) movements, these new cloud-native platform technologies taught us how to build software better. They force us to build products from an outside-in view point. You don’t have access to individual machine consoles any more. You can’t log in and inspect configurations, view logs, or restart services. We now have to figure out how to totally centrally manage the lifecycle of our products from the outside. The side effect of this was dramatically lowering the TCO of our software, and thereby raising the profit margins on our SaaS solutions.

Every new organic product we built was properly engineered as microservices in Docker containers, with automated deployments using Helm Charts and Operators, and optimized to be monitored and managed centrally. We also refactored or rebuilt hundreds of other existing enterprise software products to get as close as this architectural standard as possible. For those products we couldn’t do this to, we retired. It wasn’t easy. We had to work very closely with product management and our senior leadership to sell them on the value of this level of investment, because it meant less focus, at least up front, on customer-facing function. The payoff would come in more frequent updates that can be deployed with great agility, less risk, and little to no downtime. Quite the game changer from what I was used to.

So back to the Hybrid Cloud strategy. All this great innovation to improve the TCO story for our SaaS solutions now has to be passed on to our customers, so they can appreciate it as well. On the surface, it makes a hugely compelling story – our customers can run our software just like we run it ourselves, and get the same administrative advantage and cost savings. But the great challenge before us is how to actually deliver the same software to SaaS as well as to our customers, at the same velocity and update frequency. In fact, it is impossible.

In our most mature SaaS solutions, we deploy updates to our code in production, safely, non-disruptively, many times a day. There is no such thing as software versions or release numbers any more. In fact, we used to joke when asked what version of software we are running in SaaS, we’d respond “Tuesday’s.” When you insert a customer into the mix, however, that customer is responsible for determining when and under what conditions their software solutions can be updated. They have their own operational and business priorities, and risks to manage. We simply cannot expect our customers to deploy updates to our software at the same rate and pace as we do ourselves. There is therefore a schedule gap between when we do it, and when our customers do it. Furthermore, that gap is unpredictable. Some customers may have a strict weekly regimen to update software. Others may be quarterly, or bi-yearly.

So how in the world are we to support both an hourly/daily deployment pace as well as a customer-chosen, arbitrary update pace? In SaaS, data models and code are continuously evolving in small increments. When customers decide to update our software, we have to be able to support going from arbitrary code and data point A to arbitrary code and data point B. This dramatically steps up the discipline we as software developers must have in how we not only write code, but how we plan for it to evolve. We must be able to separate data model evolution from code evolution (something I talk about in greater detail in an earlier post). Furthermore, and more importantly, we must help our customers build trust that updating our software as frequently as we need them to is not only possible, but healthy, and less risky, then the quarterly or yearly big bang weekend deployments (that never go well).

To me, this is really what the IBM Hybrid Cloud strategy is about. It is not only about distributing software that can be run in any cloud, private or public, but also about helping our customers learn what it means to properly run a cloud-based solution, building in them the trust necessary to realize cost reduction goals and improved capability adoption velocity. I think IBM is uniquely positioned to do both.

VANs need an upgrade

July 19, 2021 devopslamb

A vast amount of the world’s gross domestic product travels over a huge global mesh of interlinked network providers called Value Added Networks, or VANs for short. These VANs work very much like our postal service: all you need to know is the recipient’s address and you can connect to them, anywhere in the world. But with VANs, the address is called an EDI ID, which is mapped to one of a million enterprises connected to the VAN network somewhere in the world. And instead of mail messages, this network exchanges commerce data: orders, shipments, invoices, payments, etc.

The VAN networks got their start in the 1980s when dialup modems were prevalent and the Internet was really just a relatively closed collection of government and academic networks. Each VAN network provider would be responsible for exchanging data with a group of businesses, many of which were in a particular industry (manufacturing, retail, pharmaceuticals, etc). And so by interlinking the VAN providers, one business could basically transact with nearly anyone in the world. But because the network was rather slow and unreliable, enterprises would mostly exchange data in large batches, once a day, and keep retrying until the batches of data were successfully transmitted.

Amazingly enough, this mode of batch-oriented transaction processing is still largely prevalent in the VAN networks today. Indeed, most business systems, like Enterprise Resource Planning (ERP) systems, evolved around the premise that business happens in batch. Shipments happen days later. Invoices even after then. Business naturally happens asynchronously, especially as it still involves human beings needing to make decisions. But all of that is changing, and rather rapidly.

Because of what is sometimes called the “Amazon Effect,” consumers are developing expectations of having up-to-the-minute status of any order, from the point of sale to the ability to track the delivery of my package on a map. That expectation as a private consumer now pervades the business world. Why not know where my order is at any given moment? Why do I have to wait weeks to get my shipment? Why is my check not clearing the bank right this minute? Why is my credit card payment taking days to post?

This lack of speed isn’t, strictly speaking, the VAN provider’s fault, but they aren’t helping either. Newer ERP systems, many of them custom-built by large companies, are designed with the ability to obtain and deliver up-to-the-minute order and delivery status using application programming interfaces (APIs). The VAN network’s continued reliance on file transfer protocols and batch-oriented processing isn’t well suited for carrying such transactions. As a result, more ad-hoc networks are being created based on integration platforms as a service (iPaaS), supporting API-driven, more real-time point-to-point workloads between companies. Newer API-driven secure networks like those based on blockchain technology are rapidly increasing in number as well. None of these use the VAN network, and eventually all of these other networks will begin to suffer the scaling issues that VAN networks solved decades ago. So how do we keep VAN networks relevant and valuable in this changing business network landscape?,

My friend Sami Tähtinen from Youredi and I, along with other members of our teams (Mats von Weissenberg from Youredi, and Liviu Rodean from IBM) have been discussing this challenge for some time now. We believe VAN networks need an upgrade, and have come up with a specification for how to leverage common Internet technologies pervasive today in a different way to help evolve the VAN networks to support more modern, real-time network transactions. I invite you to read his related blog on the technical points of the specification.

The VAN networks still provide a huge benefit in supporting world commerce and that value can’t be easily recreated. So it only makes sense to figure out a way to help VAN network providers support enterprises wishing to increase the speed of business. We call this effort Project Avanto and are in the process of socializing it with supply chain industry leaders and stake holders. Let us know if you want to join the discussion!

Disaster Recovery for Supply Chains

December 30, 2020 devopslamb

Thanks to the global COVID-19 pandemic, the term “supply chain” has become a household word, and that’s not a good thing. I was on the road not long ago and stopped at a fast food restaurant for lunch. As I approached the counter, I saw a sign taped to the cash register stating that “due to supply chain issues, fried chicken sandwiches are temporarily unavailable.” That really gave me pause, not because I was planning to order the chicken sandwich, but because suddenly the masses are becoming aware of this thing called a supply chain and problems with it are starting to impact our daily lives. Supply chains are like plumbing in the house: no one really notices when it is working properly and takes for granted that the water it supplies will be there when needed. But when plumbing breaks down, everyone notices, painfully.

The socio-economic impacts that the pandemic has caused has taught us a lot, especially about the fragility of our supply chain infrastructure. Similar to how IT infrastructure and its operations are often required by industry and 801e7803-3c30-48d0-89cc8538567d3397-480x480 corporate standards to have a Business Continuity Plan (BCP), it is obvious that our supply chains do as well. Certainly there is a degree of overlap between IT infrastructure and the supply chain, but only at the point where the enterprise controls its own segment of the supply chain. Supply chains are stitched together patchwork quilts of systems, applications, processes and human-based workflows, none of which is all under the control of a single enterprise. So what would a supply chain BCP look like? What are the elements of it?

Alternate Suppliers: The greatest impact the pandemic had on the supply chain was to the manufacturing suppliers. They were forced to shut their doors and make their employees go home. Almost all retail distribution relies on manufacturing to some degree, so as upstream suppliers closed down and on-hand inventory depleted, shelves emptied and items were back-ordered. Look at what happened to personal protection equipment (PPE) and the much-sought-after N95 mask. When stock was depleted, the world went into crisis mode. Several large manufacturers reconfigured their assembly lines to start manufacturing N95 masks and other critical health equipment, creating a sudden influx of alternative suppliers that then created the next challenge: how to connect these suppliers to the demand.

Consider how IT operations prepares for disruptions. They typically distribute multiple instances of their applications across multiple geographically-separated data centers and establish disaster recovery protocols. This practice has been in place for decades, and more modern cloud-native technologies are allowing for such infrastructures to mature to an active-active state where customers are served out of multiple instances simultaneously, creating instantaneous disaster recovery at the region level. The same practice should be applied to supplier management. Instead of reacting to such disruptions by creating and connecting to new suppliers, enterprises should plan for disruptions and establish multiple alternative suppliers and use all of them in a rotating fashion to ensure a productive and resilient relationship with each supplier.

Digitization: It is uncanny just how much of today’s modern supply chains still depend on human-centric workflows and data management. Much of our retail supply chains are still based on paper, spreadsheets, email and faxes, and the human element requires a high degree of in-person collaboration to keep it all going. So when a pandemic hits and employees are forced to work from home, those workflows break down.

Any CTO will tell you that a BCP must have an infrastructure element and a human element. For the latter, if an enterprise loses a building, say to a power trunk being severed due to nearby construction (happens all the time), what happens if the people have to work from home? Do they have the tools they need to make the transition and keep operations moving smoothly? And it must be practiced.

One of the most important things an enterprise can do to ensure the business continuity of its supply chain is to fully digitize the process. Spreadsheets and paper forms should be replaced with web forms that can be accessible securely from anywhere over the Internet on any device, and supply chain specialists should be required to work from home regularly. The “new normal” following the pandemic will most likely make the latter commonplace now.

Visibility: Critical to the successful operations of IT infrastructure is the ability to monitor and assess its health and predict when problems might or will happen based on observed data. A supply chain is no different, and in fact end-to-end visibility of the supply chain’s status and health remains the number one challenge with enterprises today.

But visibility of supply chain status is only a part of the challenge. The pandemic shows us just how important the ability to asses outside events’ impact on the supply chain is, which is not easy to do. External influences on supply chain efficiency and health come in many forms, the most obvious of which being weather events. That’s pretty straight forward. Weather events like hurricanes or typhoons have geospatial coordinates and zones of influence, which can be easily layered on top of a map of suppliers, manufacturers, distribution centers and shipping lanes/routes, to predict when weather will impact the supply chain so an enterprise can take corrective measures, such as enabling an alternative supplier (per above).

But the pandemic has taught us that there are socio-medical impacts to a supply chain as well. COVID-19 infection rate and location data is readily available from many sources, including our IBM-owned Weather Channel, and all of it has geospatial data associated with it. There is a huge opportunity in the supply chain control tower space to overlay these various sources of external stimuli to an enterprise’s supply chain landscape and apply artificial intelligence to predict disruptions, very similar to how IT operations monitors its infrastructure and predicts potentially impacting trends and disruptions.

TTR vs TTS: Time to recovery (TTR) is a common metric in IT operations, and measures the average time it takes to recover from a infrastructure impacting event. IT operations leadership’s goal is to drive TTR to as small a figure as possible. The exact same applies to supply chain operations. When there is an impacting event, regardless of the size or scope of impact, minimizing TTR is essential to ensure business continuity.

But unlike IT operations, supply chains also introduce another metric called time to survive (TTS), which basically measures how many days of inventory an enterprise has on hand. If the upstream supply chain is broken, those days of inventory represent just how many days that enterprise has until it must shut down as well. Keeping excess inventory on hand is expensive. It either consumes valuable storage space or creates tax liabilities, so for decades enterprises have striven to minimize the amount of inventory requires to be on-hand. Just-in-time manufacturing and drop-ship directly from vendor are common examples of instruments used to balance between low inventory requirements and on time/in full order fulfillment. So in most cases, TTS is not very long – a few days at most. And if TTR exceeds TTS, then “Houston, we have a problem.”

Enterprises cannot afford to keep access inventory on hand just in case another global pandemic happens. Neither can IT infrastructures afford to keep access capacity lying around unused just in case some sudden demand occurs. Modern IT infrastructures have matured to adapt to sudden demand increases by scaling capacity real time, using monitoring of KPIs like memory, CPU and storage usage to determine where and how to scale and keep ahead of the demand curve. Supply chains must be able to do the same.

One any given day, TTS should be low, to keep costs down and drive higher profit margins. But coupling improved visibility and prediction with alternative supplier onboarding should improve an enterprise’s ability to build additional inventory ahead of a supply chain impacting event. Home improvement stores have been doing this for years, suddenly building and shifting inventory of plywood, generators and other emergency preparation and repair materials ahead of a hurricane. They aren’t just trying to serve their community; they are not wanting to lose business because of stock outage situations.

So as you can see, there are a lot of business continuity planning similarities between IT infrastructure and supply chains. Many of the same disciplines, technologies and practices apply. In both cases, it is about being prepared and about practicing that preparedness. Without practice, a BCP remains a really expensive insurance plan that an enterprise would hesitate to cash in.

Plumbers vs. Painters

November 12, 2019 devopslamb

If you are like me, you have difficulty sometimes explaining to others what you do for a living. You commonly employ metaphors to describe your job function, or the type of work you do, day to day. Not that the other person is stupid, its just that they just can’t relate, and you don’t want their eyes glazing over when you talk about your work.

I often describe myself as a plumber, versus a painter. I’ve spent the greater part of my 184387712.184387713 career in networking and network protocols, the original IT plumbing. In fact, when I joined IBM back in 1991, I started with a small team working on a Systems Network Architecture (SNA) protocol stack implementation for DOS. Yes, DOS, as in Disk Operating System, the basic x86 operating system originally designed to run straight off a 5.25″ floppy disk, well before there was Windows or Mac OS. I found myself having a natural affinity to the things running in the backend, and likewise found I didn’t really have a knack for fancy UI or API development. So, I describe myself as a plumber instead of a painter.

I think the metaphor works really well. Most programmers are either drawn to plumbing or painting, but rarely both. Typically you’ll do one only to facilitate the other. However, painters get all the glory. It is easier to show off a good paint job, and really, really nice jobs get you public acclimation and a constant feed of clients. However, a bad paint job won’t condemn a house.

Plumbers don’t get any glory or respect. When you enter a building, you just assume the plumbing works: faucets turn on, hot water is hot, and toilets flush. However, better paint jobs sell office buildings and homes better than awesome pipe work. If plumbers do their job right, you’ll never see the pipes. But if plumbing doesn’t work, or fails, terrific disasters typically ensue. Therefore, logically, I would argue that while a good paint job may help attract the buyer, an excellent plumbing job will help keep them as tenants.

It is also a far more difficult and expensive job to maintain plumbing than it is to fix up the walls and slap on some more paint. Sure, a good paint job is labor intensive and disruptive, but it is far more difficult to upgrade plumbing while living in the house. It would be akin to changing a ships propeller under high seas. So excellent original design and engineering as well as ongoing maintenance are key to keeping the tenants happy and in their homes.

So to all you “plumbers” out there, even though what you do day to day is rarely seen outside your four walls, rest assured that your work is appreciated, valued, and necessary. You can sell an ugly house, but you can’t sell a house without plumbing, not within shot of any incorporated city limits anyway. Building codes wouldn’t allow it.

Sometimes you just have to quit

September 20, 2019 devopslamb

I hate saying the word “quit” because it has such negative connotations, but that is really what I mean. In almost any task, there is a point where you know it is time to stop, because any additional effort would not only yield a diminishing return on investment, but could also actually do harm. I came to terms with this on a very large cloud-based SaaS solution project not too long ago.

I was part of a large team building a SaaS solution based on a series of well-established enterprise software solutions. There is plenty of literature and opinions on this topic, most saying “don’t do it!” But we did, and made a good business out of it, but as you can imagine there are a lot of issues with cost with such solutions. We were in a constant state of finding ways to lower costs and improve profit margin. I have written about costs in the cloud before, and the most obvious hot spots involve human labor, as humans are the single largest cost component in SaaS delivery. So things like automation were first to come to mind.

Inline-504020131.504020132

You can probably relate. After a couple of years, with proper investment, you’ve probably done just about all you can. What you have left would likely take more effort than the resulting savings is worth. Most likely what you are still left with is a large monolithic application (or applications) that still needs to be tested as a system and moved as a unit from test environments, through staging, and production, so there goes any sense of continuous integration and deployment. Just not possible.

And the act of safely deploying the solution into any environment probably requires the heroic effort of well-trained operations personnel to ensure all the parts are deployed in the right order, successfully, and then tested fully. And it probably still requires a weekend outage window. So much for DevOps.

A peer of mine and I had a conversation on this topic on a fall afternoon. If we had beers in front of us, we’d be crying in them. We were lamenting the fact that we just couldn’t seem to get over the hump, to solve specific automation problems, get humans out of the process flow, and otherwise get our costs down to target levels (admittedly they weren’t all that bad, but just needed to be better). It wasn’t for lack of trying; it was just that we weren’t seeing a lot of success. We came to the realization right then and there that our technology choices were holding us back from making any further forward progress.

Large monolithic application designs, enterprise-grade data storage, lack of continuous availability, and strong ties between application and data schema levels – all prevented us from really reaching the cost nirvana of a well-oiled delivery pipeline. It was time for us to quit trying, to realize our cost structures were what they were, and that the only way to make the really game-changing improvements was to start over and build a cloud-native solution using cloud-appropriate technology, to take what we had learned over the years, especially about how NOT to do things, and do it right from the beginning.

The second unfortunate reality is that the business rarely affords engineering the opportunity to do just that.

The show must go on

September 10, 2018 devopslamb

“The show doesn’t go on because it’s ready; it goes on because it’s 11:30.” -Lorne Michaels

spotlight

I saw this quote recently and it made me stop and think. Then I read Tina Fey’s reflection on that very same quote in an excerpt from her book Bossypants. I think she nailed it. “…I think it’s a great lesson about not being too precious about your writing. You have to try your hardest to be at the top of your game and improve every joke you can until the last possible second, and then you have to let it go.”

Hopefully you are already sensing where I’m going with this.

Engineers will tell you that they hate deadlines. We know effort sizings and resulting plans are flawed to begin with, that no deadline is real, and thus hate having to be told when something must be done. I’ve blogged before that we are never actually done with something, but we have to be able to know when something is good enough to go out – to just let it go.

What would actually happen if we told engineering there was no date for delivering the initial release, the MVP (minimal viable product)? Just release it when it’s ready. Well, I’ll tell you what would happen: It would never go out. Engineers would tweak and test it for ever, perpetually unsatisfied with the current state. There would also be no sense of excitement that comes with looming deadlines or accomplishment when they are met. There would be no ability to see what customers do with the result – what they like and dislike, what works and what doesn’t. There would be no sense of what to do next. There would also be no sales, and thus no revenue.

Apply the DevOps and Agile practice of continuous integration and delivery and now you appreciate the benefits of many, many deadlines and deliveries in quick succession – the impact of failure is small and course corrections are faster. Ms. Fey says in that excerpt that “No matter how badly an improv set goes, you will still be physically alive when it’s over.” Yes, we will survive our mistakes and near misses when something goes wrong with a release, and as engineers we will inevitably blame “the date” as one reason it happened. The code simply wasn’t ready. Well guess what – it never will be, given the engineering definition of “ready.” It simply isn’t possible. If I’ve learned only one thing in my nearly three decades of software development, it’s that if we keep testing, we will keep finding bugs and consequently more work to do. The truly hard part is deciding it is “good enough” to go out.

This is why engineering is never, ever celebrating the release of software. It is ALWAYS an anti-climatic event. We KNOW how much work is left before us – the technical debt accumulated from the problems we simply had to punt on, the incoming set of requirements based on feedback from customers who are using the stuff we just released, and the backlog of unaddressed requirements from the previous release we had to defer. The only ones celebrating are the Product Managers, because now they can FINALLY sell something. (Development Managers try to appear like they are celebrating too, because they are good team players, but they, like the engineers, know what’s ahead.)

So for you engineers out there reading this, realize that while you hate having deadlines based on unrealistic dates, they are a necessary forcing function. They force us to make the hard decisions about putting valuable capabilities into our customers hands and make money for the business. They force us to be iterative and agile. They force us to realize that perfection is impossible, and as Tina Fey says, it is also overrated and boring.

Transparency in communication

June 17, 2018June 17, 2018 devopslamb

I was stuck in traffic the other day on a two-lane road, at a dead standstill or crawl for almost 1.5 hours. I really didn’t have another way to my destination, so I had to stay the

analysis blackboard board bubble — Photo by Pixabay on Pexels.com

course. Naturally, I began to wonder what in the world was going on up ahead to cause such a set back. Gradually, drivers started getting out of their cars, crossing the street to the opposite shoulder, trying to see what they could, talking to each other to share what little details they had. One or two of them started flagging down drivers coming from the other direction. Most of them had just turned around in frustration, to try to escape in a different direction, so had no additional information to share.

But all this got me wondering what difference does it make? How does knowing why we were stuck in traffic change the fact that we are stuck? The only thing we could possibly do about it would be to do what others did and turn around and find another route. I guess knowing if the road is blocked indefinitely would help make such a decision, but really, there is nothing else you can do but wait it out, especially if that is the only route you can take.

I got to reflecting on that, given the copious amount of time I had on my hand to think. We as humans in general don’t like information voids. I want to understand the reason for something happening, especially if it is impacting me negatively. I may not be able to do anything about it, but understanding the reason is comforting and informs future decisions, such as turning the car around. I simply don’t like not knowing. And this wouldn’t be an entry in my blog if I didn’t somehow draw a parallel between this revelation and the Enterprise Cloud world.

When a cloud service I have worked on experienced a customer-impacting event, or outage, one of the most common pieces of feedback we get from customers isn’t frustration over the impact of the event, but over the lack of communication during and following the event. I believe there are two facts about engineering nature that get in the way of clear and early communication:

Software engineers don’t like to admit error. Really, this is human nature, but software engineers are particularly prideful and hesitant to divulge details about something wrong with their system.
Software engineers don’t want to communicate misinformation or incomplete information. They want the facts to be straight that are to be given to customers.

These two facts result in terse or delayed communications, which often frustrates customers much more than the incident itself. In cases with many cloud vendors, if you want incident details (often termed Root Cause Analysis, or RCA), you have to formally request it, and often wait days or weeks before a response is given, only adding to the frustration.

Having interacted with many such frustrated customers in my career, I find they often feel like being stuck in traffic, at a dead standstill, and are thirsty for any bit of information available to help them make decisions and possibly take corrective action on their end to work around the disruption. Another important fact that we in the cloud tend to forget is that our customer contacts are typically responsible for hundreds or thousands of other users, and it reflects badly on them when they don’t have the information necessary to respond to their user’s complaints and support tickets.

Coming from traditional on-premises enterprise software delivery, I am painfully aware of the effects of missing or incomplete information around service disruption. I have spent countless hours in “war rooms” at customer locations with the customers themselves pacing the floor and looking over my shoulder, asking for status every 5 minutes as I debug memory leaks or runaway CPU issues. It isn’t that they are curious as to the cause – it is that they have executives and users to report status to, often, and need the most up to date information possible. I think in the cloud industry we have lost sight of the notion of the customer over the focus on the user. Striving for more self service, putting more capability in the hands of our customers and administrators, has resulted in detachment between the solution provider and customer consumer. While the solution may have changed from traditional on-premises software deployment to cloud delivery, customer expectations around support and transparency in communication certainly have not.

Most of our customers run their own data centers and software solutions, and all customers implicitly understand (though often won’t admit it) that software and hardware solutions are not perfect and failures do happen. I once read a joke that there are only two types of disk drives in the world: ones that have failed, and ones that will fail. I recently said that in front of a customer when explaining the cause of a recent failure. He sardonically replied that there are only two types of data centers too. Indeed.

The lesson here is that frequent and transparent communication about a customer-impacting event goes a long, long way to allay customer fear, uncertainty and doubt (FUD). When an impacting event happens, customers want to be comforted to know:

We are aware of the problem. (You don’t want customers making you aware of an issue with your own service. That’s really, really bad.)
We are actively and aggressively working on it.
Any information available about an estimated time to resolution.

Typically, I have found that if 1 and 2 can be communicated quickly and regularly, 3 can be unknown, for a few iterations at least. In the end, our customers don’t like being stuck in traffic any more than we do. Just like us, they need as much information as possible so a decision can be made, to assert some measure of control in a situation that they otherwise have no control over.

Decisive decisions

January 10, 2018 devopslamb

This image resembles what the Neches River recently looked like from my seat in an airplane on my way back to my hometown in S.E. Texas. That meandering river is what rivers do when the land is really flat, without the force of gravity to encourage the river to cut a straighter course. In this part of the Neches, the water is very slow, and with so many twists and turns it takes forever to get anywhere by boat. The river will change its shape over time, as slight forces exert on the turns to curve the river even further, even causing bends to merge and change direction.

While gazing at this view, it occurred to me how closely that resembles what it feels like in an organization that lacks clear direction. Team members wander aimlessly between projects, unable to make decisions because of a lack of vision to act as a yard stick against which to measure options and make decisions better and faster.

Clarity of vision requires crisp decision making. Leaders need the ability to gather accurate information, weigh the options quickly, make a decision, and stand by it. Decisions don’t necessarily have to be correct, mind you, they just need to be made. There is a lot of value in making a quick decision, putting a stake in the ground and trying that path, even if it ends in failure. We are so cultured to avoid failure at all costs even at the risk of indecision. At high velocity, decisions are made quickly, failures are realized sooner rather than later, and course corrections can be made even quicker. We all know we learn more when we fail than when we succeed. In the process, you build an organization that is nimble, forgiving, and wiser in the end.

Crisp, fast decision making is rushing-river like the force of gravity straightening out that river. Your team moves faster as a result, pushes obstacles out of the way more easily, and course corrects faster. Don’t be afraid to make a bad decision – just make a decision.

Enterprise to Cloud

Personal experiences taking enterprise software to the cloud.