Karney Li – CTO @ Wealthsimple

A Century Ago

Currently, I’m listening to Ken Follett’s Fall of Giants. I really love historical fiction and in particular Ken Follett’s writing. I read Pillars of the Earth years ago and recently finished listening to World Without End. Fall of Giants takes place during WWI, the best thing about the work is realizing how much the world has changed in the last 100 years. I’ve grown more grateful to be living in a time with so many luxuries and creature comforts.

Our governments might seem incompetent in the moment, we should expect better. But if we compare to the situation of 100-years ago, they were were truly incompetent and in turmoil. In 1914, Europe was largely run by the landed aristocracy, democracy was rare. England had rule over many countries as territories. Universal suffrage was a dream for many countries, even women in England were just getting the right to vote. The aristocracy was cruel and poverty commonplace. Racism and classism was the way of life. The world literally fell into the Great War because of silly chains of bilateral agreements to backstop allies. Our governments today are not without their faults but there’s been objective improvement. The spread of democracy has improved the lives of many but there are cracks showing. There are social riots in the US, but in general the world is relatively geopolitically stable. Nevertheless, it’s the improvement in life brought on by industrialization and technology that stands out for me.

WWI was largely fought with horse and cart to supply front-line troops. This was a war fought largely on the ground, in the trenches, not the air. Solders would charge at machine guns hoping to overwhelm the shooter before they could reload. The Great War was fought by people who didn’t even truly understand the power of the weapons they yielded, many had never been tested in battle.

The understanding of physiology was primitive at best. At this moment, I sit at home healing from minimally invasive foot surgery. I was put under general anesthesia and they used live x-rays during the procedure to guide a power-drill. Such a procedure 100-years ago would have been technologically unfathomable, my procedure took less than an hour!

In today’s age, I watch YouTube on my flat-screen TV, control it with my smart phone, write this blog entry on a laptop. Their only forms of entertainment were theatre, tea time, drink, and sex. Maybe our endless diversions are the things that keep us from going to war.

Today, I often prefer to listen to audiobooks using a wireless headset. It frees up my hands and eyes to work on other things like woodworking with power tools. In 1914, electrical lighting inside home was uncommon except for the most wealthy. Putting on a record or the radio was a great luxury. Today, we expect to stream media and get entertainment on demand. How challenging it would be to pass a pandemic without these advances!

When it gets cold, my smart thermostat knows to turn up the heat. And now I have the equivalent of a printing press in my home the size of a shoe box. When I need goods, I don’t even necessarily need to leave the house, everything is deliverable through my smart phone. The wonders of the modern age are really amazing when we look at how much the world has changed in the last 100-years.

In Canada, we gripe about our slow COVID19 vaccination rollout, in 1918 the Spanish Flu killed 20+ million worldwide (somewhere between 1-5% of the world population, depending on your favourite source), vaccines couldn’t even be developed in the timelines we see now. We should be grateful for the advances we’ve seen in science and technology.

Food was rationed during the Great War, today we suffer from a different problem, over indulgence. During the lead up to the Russian Revolution, Russians were lining up at midnight to buy rationed bread. Today instead of starving, we are more likely to eat ourselves to death.

Sometimes technology seems to take longer than we thought it would, people says it took 20 years for the internet to really take-off. But it took smart-phones less than half that time to become ubiquitous. Today now we have vaccines that can be developed in < 12 months instead of years. It’s possible in the near future, that after a virus is sequenced, we could devise a vaccine within weeks. With advances in computing and biosciences this is not unfathomable.

I turn 40 this month, I hope I can live long enough to see what the next 50 years will bring. It is an exciting time to be alive and grateful.

Cross-Functional Squads

A few years ago, Spotify popularized the idea of organizing engineering teams into squads and guilds. This isn’t actually a new concept, it’s an old one given a new name; Amazon has been organizing development teams into two-pizza teams along the same principles at least as far back as the early 2000s. Organizational design is something that most engineering managers need to tackle with at various points in their careers.

When teams are small, they often index for hiring full-stack developers (i.e. jack-of-all trades developers). This often serves them well, especially considering when all they can afford is 2-3 developers. As teams grow, individuals tend to specialize. You’d be really hard pressed to find a developer that’s good at mobile, web, backend, distributed systems, devops, data engineering, systems engineering, etc. I’ve never actually met a developer that was actually excellent across all these engineering areas. I’ve met some that were fearless and willing to dive in and learn, but in most cases, they’re just average in the areas that aren’t their main areas of expertise.

The purpose of a cross-functional team is to outfit the team with the right tools to complete the job without the need to depend on outside help. The more a team needs to collaborate and coordinate externally the slower they get. Decoupling teams gives the team autonomy and the ability to own the situation.

Analogy time! Let’s take a basketball team, everyone on the team should be able to dribble, rebound and perform layups, these are basic skills. But individual players will have their areas of expertise, some will be excellent 3pt shooers, some excellent at posting-up under the net because of their size, and others great at play-making because of their speed/agility. A team with 5 centers in the starting lineup would likely be a pretty bad team, they’d be slow to move down the court, have their passes constantly intercepted, and balls stolen whenever they dribbled against a smaller more agile player. Each person on a team is filling a position, a good team is a team of complementary players that help each other be effective at what they do best.

We want developers that in a pinch could debug a web form or look at why an API is suddenly slow, this isn’t necessarily a full-stack developer, that’s just a willingness to do something when called upon, a lot of it just basic debugging skills and understanding of how software works. In a pinch, a center can shoot a 3-pointer, but you probably don’t want them to do that all the time; if you want your teams to be effective, get folks to focus on what they do best.

Making bad decisions under-stress

Sometimes people get stressed out about their job and want to quit. The stress is real and can consume their waking thoughts and keep them up at night. That lack of recuperation is as a death spiral, their body and minds don’t rest and they end up in an even more stressed state. In most cases, when in this state of mind they probably don’t even end up doing their best work.

Many are able to realize that it’s hurting their ability to function, either socially or professionally. They snap at their colleagues and family. They know it isn’t healthy, they want it to just stop. They might give off hints that they’re stressed, ask for less responsibility, start getting sick. In the end, they usually submit a resignation.

Making a decision to resign is a cry for help. To a manager, it definitely raises the alarm bells, especially if you believe that person to be worth saving.

As a leader, I try to help them see a bigger picture, take a step back from the ledge and see there are other options available to them, usually options that I can offer to them that they might have thought impossible. A sabbatical, a change of team, a different working arrangement, additional resources, etc. Offering additional compensation is not the solution, it will not solve the cause of the work-related stress. I usually present a few options, tell them to take a weekend or an evening to think things over. Sometimes taking in the options to consider can be stressful too. Ask them, if working at the company is something they still want and if so, in what capacity. If they come back and they still insist on resigning, it’s probably something else.

Understand that people in stressed-out states usually only see a limited set of outs, they put it all on themselves, work harder, longer or quit. They become hyper-focused but myopically so, seeing the wider gamut of options is difficult, as a leader, we can help them see this and that it’s solvable.

Calendar is like a hard drive

I recently noticed how my calendar is very similar to a magnetic hard drive.

Back when magnetic hard drives were more commonplace, a drive head read sectors of a magnetic disc as it spun. Sequential reads were performant whereas non-sequential reads were very expensive because it would require the disk to spin around again before the head could to read that sector to retrieve the block. When a file was written to disk, the disk controller would try its best to find a continuous block where it could fit the entire file, if it couldn’t it would need to split the file into multiple sub-blocks, ideally, it would still be in sequence as the disc turns. This type of write strategy leads to file fragmentation over time if contiguous blocks can’t be found to write the files.

Calendars are somewhat like this, but has some characteristics of a ticker tape (things that happen in the past are less important that upcoming events or those in the immediate future).

In order to do effective work, people need contiguous blocks of time. Two hours is generally a lot of time to work on a task and get it to a state of some semblance of coherence that can be shared with others for feedback. If you break two-hours into eight 15-minute blocks chances are the output will be scattered, a lot less voluminous and of low quality.

Sometimes people like breaks in their calendar because they find back-to-back meetings for hours on end to be exhausting. So they leave holes in their calendars. I generally avoid this. I found that if I’m going to do 1:1s, I’d rather be in the 1:1 flow and just do a whole string of them back-to-back. Over the years, my stamina for paying attention has improved and I don’t find it exhausting at all.

To enable sequential reads, I often book off time for concentrated work. Thoughtful work takes focused-time, multi-tasking is an illusion, the cost of context switching is an expensive tax. With computers, we can compute the cost of their context switches, but humans poor machines at quantifying it.

One strategy I often employ at the beginning of the day is the calendar defrag. I look through my calendar, inevitably there’s some random 1:1s or meetings that are smack in the middle of an otherwise potential 2-hr block of free time, in those cases I’ll request to move the event and reserve the 2-hr block for focused work.

Give it a try, I’m sure you’ll find you’ll become less randomized, more focused and produce high-quality work.

It doesn’t scale

“It doesn’t scale” is something I say often. How do we know something doesn’t scale? Software and organizations are systems; systems have inputs and create outputs. The amount of output in relation to the inputs gives an idea of the throughput of the system. It’s not necessary to get super scientific or mathematical about this, just understand the thought process. If we increase one of the input variables and the system can’t keep up, this is a failure to scale.

It’s the role of a manager/operator to understand the bottlenecks in a system and devise solutions and to have the foresight to plan for addressing future scaling bottlenecks.

A simple exercise is taking each of the input variables and increase each individually by the next order of magnitude (e.g. 10x customers, 10x the page requests, or 10x transactions, etc), then work out what would happen to the system. Then identify the parts of the system that would need to compensate for the increase in input and determine if the compensation is linear.

Parts of the system that scales linearly in relation to the inputs are the constraints of the system. Perhaps, your input is “customers” and you discover that scaling customers requires a corresponding linear increase in employees (e.g. each customer requires 5 additional staff). If you want to 10x the number of customers it would require a corresponding 10x increase in employees. To illustrate why this is not a good relationship, let’s assume you have 10 customers and 50 employees. Would you be able to hire an additional 500 employees to support 100 customers? Maybe you could, but what’s the lead time to that, what systems and processes would you need to put in place for that to happen? Growing from 50 employees to 550 employees is fraught with problems you have never encountered before. That’s for you to determine if it’s reasonable, or if there’s a better solution.

Solving for bottlenecks involves understanding the fundamental relationship between parts of the system. Merely doing something faster may not yield the increase in output desired. I often seek leverage, leverage occurs when effort has a multiplier effect on the output.

Bespoke things generally do not scale, but if those bespoke things can have leverage then there’s a multiplier effect. Let’s take a tailor on Saville Road, a tailor that creates bespoke suits might produce a product of very high quality, but it relies on his singular skill to design a suit for that client. It requires hands-on measuring, fitting and re-fittings. The only way they can grow their business is either by charging more for his services or by working longer hours, this has limits and is a real-world bottleneck. Technology has always provided avenues to solve bottlenecks, it can help decouple parts of the process, introduce parallelism, or automation. As an exercise, think of how you might scale this business. Determine what you trying to solve for and what are you willing to compromise on to get a result that meets those objectives.

TIL – You can’t wire from TD to TD

Over the weekend, I tried to wire funds from my TD bank account another TD account, yesterday I learned it had bounced back.

At the time I sent the wire, I noted to the teller it was going to another TD account, she didn’t seem to have any concerns. This morning, I went to the branch at Brookfield Place to ask the branch manager to investigate.

While it’s nice that I get to save on the wire transfer fees, this seems like a broken abstraction. There’s no reason why TD’s wire department couldn’t detect this case and just do the inter-bank account-to-account transfer. To them, this should just be an internal accounting issue. Another thing that I learned through this process was that this limitation was a little known at TD across multiple branches.

Educating tellers and staff is not high leverage, humans forget easily edge cases they don’t encounter often. Their systems should either have abstracted this limitation or informed the staff that this should be done via an account-to-account transfer. Catching exceptions early or making them transparent to the end-client leads to better end user-experiences.

Deconstructing Agile

To really understand something, we need to read beyond the surface, beyond the rituals and understand the underlying motivations. The Agile Manifesto for developing software isn’t prescriptive about methodologies. The manifesto only has four points:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

There are many methodologies that teams can adapt to develop software in an Agile manner, Scrum, XP and Lean are amongst some of the more popular methods. Often teams adopt practices they like with varying degrees of success.

Software development is often over-complicated. The goal of creating software is to build something that people want. To determine what people want, we need to get customer feedback. Time spent creating things that the customer doesn’t value are a waste of time and effort.

The whole point of Agile is to release software more frequently because the more frequent software is released the sooner we can get customer feedback.

The objective of Agile is to optimize for customer feedback.

Without loss of generality, Scrum is really a compressed waterfall, one compressed into a 1 to 2-week “sprint”. When time is compressed, it puts constraints on a team, this is by design. By constraining the amount of time allotted to sprint planning, the team is forced to find other ways of describing and estimating work. They no longer have time to collect detailed requirements, write comprehensive design docs, or come up with detailed work plans and effort-estimates. The team is forced to optimize for what is deemed high-value for the customer. At the end of the sprint, the team gets feedback from the customer. At worse, they will have worked on something for 2-weeks that the customer doesn’t want, they can then pivot the very next sprint. This is pretty good, it’s unequivocally better than working on something for months or years only have the customer reject it.

If you try to optimize for customer feedback, you’ll notice that there’s still a potential 2-weeks worth of wasted time. The thing that gates customer feedback is a team’s frequency of releasing software. By releasing even more frequently, the team can get feedback even sooner. The feedback enables customer collaboration, a key value of Agile.

Lean can push this to the extreme. Assume developers are expected to deploy to production what they started on that morning. Now instead of releasing only at the end of a sprint, they release changes every day so that customer feedback can be solicited even sooner. This constraint means the changes deployed to production need to be small, small changes also limit deployment risk. Instead of a worst-case scenario of wasting 2-weeks, the worst case only a wasted day.

Doing stand-ups, sprint planning, kanban, kaizens, demos, velocity tracking, retrospectives, are just rituals, they don’t necessarily make your team Agile. I find many of these so-called Agile rituals downright wasteful. Understand the rationale behind the rituals and what to optimize for and it will lead to appreciation and enlightenment.

One Piece Flow

Many people don’t realize that Lean Software Development is based on the Toyota Production System (TPS). Toyota’s way of engineering cars is radically different than how most companies build cars. Instead of batch and queue (i.e. mass-production assembly-line), their manufacturing process is optimized for one-piece flow and pull-processes. Toyota’s way of building cars has demonstrated to produce higher quality products, increased productivity, faster time to market, and happier more accountable employees. Whereas assembly-line mass-production often results in over-production, lots of waste, and demoralized employees.

Achieving one-piece flow is about figuring out how to create a single quality unit as fast as possible while minimizing inventory, wait-time, transportation, hand-offs, etc. Through this process, the overall time from beginning to end is minimized to contain only all value-add steps. When there are defects in the manufacturing, the entire process must halt and the problem immediately fixed to ensure quality is built-in and there are no defects in the end product. Many organizations have problems implementing a true one-piece flow, it requires constant improvement, or kaizen.

Continuous Integration & Continuous Delivery (CI/CD) is the Toyota Production System’s concept of Flow applied to software development. At Toyota, the one-piece they’re making is a single car. In software, the analogous one-piece is a deployable software feature/task. Assuming trunk-based development and a modern containerized CI/CD setup, a developer typically writes code, runs unit tests locally and then commits it to a pull-request. As the pull-request is committed against, the new code is continously integrated and tested by the continuous integration system (i.e. built-in quality). After the pull-request is reviewed by a peer it’s permitted by the system to be merged into the master branch. At this point, a container is automatically built by the build system. Once built, the container is sent to a staging environment, and service tests are run against it. If they pass the tests, the container is blessed and deployed to production. Deployments to production must take minimal time but be fail-safe (i.e. blue-green), if they fail, they must rollback automatically. This process is a true one-piece flow, the build contains only the feature that is asked. If there’s something that breaks during any of the processes the developer is right there to fix it immediately. They don’t move onto another task until the feature is deployed and running successfully in production, they stay engaged the entire time.

There are pitfalls that people sometimes fall into while developing software. Sometimes people bundle multiple features into that deployable one-piece. Batching actually increases risk, when there’s more new code it increases the chance that something is defective. When something is defective, it now requires rolling back the batch of features instead of the single feature.

Sometimes people cut corners, they don’t write unit tests, integration tests or service tests. This is a false economy, you can’t have built-in quality if you don’t employ the right processes and safety nets. The engineers at Toyota say, “the right process will create the right results”. At Amazon.jp warehouses, they have a saying plastered everywhere, “safety is #1”. The same concepts apply to software development. The tests are the safeties.

There’s a lot of tooling that needs to be invested in to make one-piece flow work, sometimes people don’t spend the time investment to put the right machinery in place or continue to improve to make the processes more efficient (reducing waste). Good hygiene and exercise is necessary if you don’t want to get slow and fat.

Software can be complicated, the process of developing software doesn’t need to be.

Thinking Distributed

The best analogy I can think of for how to think about scaling an organization to think distributed is the contrast between communism and capitalism methods of government. Objectively speaking, one is not necessarily better than the other, but in practice one scales better in our current world.

Communism is centralized command-and-control, it is a non-distributed organization. In theory, if you had perfect knowledge and ability to react in real-time, there would be minimal waste and demands would always be met with the right amount of supply. In practice, it hasn’t played out this way, probably because we don’t live in a world of perfect data and instantenous data processing abilities. Capitalism in contrast, relies on that invisible hand, it argues that in efficient markets, supply-and-demand will balance itself through price discovery. There might be some degree of loss in the beginning but it will come to an equilibrium fast and the overall oversupply or undersupply will be minimized.

In capitalism, when the government wants people to spend money, they lower interest rates. Instead of directing, they guide. They create incentives for the market to move in the desired direction but they don’t direct. In communism, there’s a required a feedback loop to central command so it can process the data and tell each type of producer the next steps. This often leads to either over-production or under-production because it’s easy to have incomplete information or data processing delay.

Leading a distributed organization is similar to working with monetary policy. We want to create the right incentives, we want the default thing to be the obvious thing. We want to build the necessary tools and safety-nets so that people can focus on their objectives and not worry about duplicating work. We want teams to operate independently, instead of relying on upper-management (central-command) for next steps.

Leadership cannot scale to have perfect knowledge of everything teams do, otherwise the leadership team ends up having massive reporting apparatus just to track work output and teams will end up idle waiting for the next instructions. Instead leadership needs to empower teams to want to act in their own best interests because those interests. Leadership can guide teams by helping them clearly define KPIs, but how they go about achieving those KPIs should be left in their hands, doing anything differently will only disempower.

Disagree and commit

We live in a world of inter-subjective realities. Humans created religions, economic models, monetary systems, government, corporations, and moral values. None of these things exist in nature, they were created from our imagination and we applied human value to them.

When a person joins a new company, they will often discover that the company has a set of values. These values were chosen by the leaders of the company because they want these values to guide the company’s actions. These values are frequently repeated and made highly visible to keep people aligned. The company’s values might actually differ from an employee’s own personal values, this leads to some degree of cognitive dissonance. The employee may disagree with some of the values, but the company doesn’t need them to agree but to commit to it.

Humans are masters of overcoming cognitive dissonance. When someone converts from one religion to another, seldom do they buy-in 100%. There are likely beliefs, values, or rituals they don’t agree with or understand, but they still converted to this new religion and go through th rituals because they’ve committed to doing so.

Imagine someone immigrating to a new country. This person might immigrate because they like the way of life, the natural landscape, the employment opportunities, the people or the social benefits. They might not agree with the taxation system and how that money is spent (cognitive dissonance) but the government still expects them to commit to paying their taxes.

As a leader, it’s important to recognize the difference between personal emotional comfort vs operational need. We often want people to agree with our values when it is more imperative that we get people to commit and deliver.