MHCI x Meta

Utilizing a Multi-Modal, Large Language Model-based assistant to support people and knowledge systems in Meta’s Hyperscale Data Centers.

The Team

Meta MHCI Capstone

Client

Experience Design Lead

Duration

8 Months

THE PROBLEM

Current methods of knowledge sharing within and across data centers are one of the key obstacles to automation and scale.

People are a core component of the data center ecosystem, but current systems do not properly support their workflows.

Technicians perform cognitively challenging tasks while dividing their attention between the server rack and their laptop. They may also spend more time searching for solutions than solving problems, all while under strict time pressures from SLAs

Large, distributed systems are difficult to maintain and even more difficult to access and parse efficiently.

As equipment changes rapidly in data centers, the documentation ends up lagging behind. It perpetuates a disuse of formal documentation and encourages a reliance on a tight, local network.

Inconsistencies in processes lead to infrastructure downtime, hindering scalability.

Critical data in one area of the data center may not be at scale so technicians end up solving the same problems over and over again. This is unnecessary downtime that Meta desperately wants to avoid if it wishes to scale in response to increasing computational demands.

How might we improve the capture, retrieval, and execution of information, to enable engineers within and across data centers to perform at a higher level and learn from others’ experiences?

INTRODUCING...

Nova

An LLM-based smart assistant, trained on various sources of documentation from within Meta's systems, that aims to provide fast, real-time, and relevant support to technicians throughout their ticket resolution process.

DESIGN & IMPLEMENTATION

The Overarching Principles Behind Nova's Design

Relevant & Preemptive

We want our system to act on behalf of the user rather than simply acting as an assistant in response to the user; offloading some of the burden of the capture, retrieval, and execution of information.

Integrated & Multi-functional

An LLM's strength lies in sourcing information from a dense database and generating content. We're using this to connect previously isolated knowledge sources.

Invisible & Seamless

Creating documentation does not interrupt a technician's workflow. In fact, most of it is automatic, only needing user intervention when for small edits.

Flexible and Adaptable

A technician's ability to interact easily with their laptop varies through the repair process. As a result, flexibility in the modality of interactions drastically improves usability.

DESIGN PRINCIPLE #1

Nova must be flexible and adapt to the engineer's workflow.

sit·u·a·tion·al a·ware·ness

noun

design based on an understanding of how a stakeholder or a user's context influences their physical and mental capabilities at any point in their workflow

(Image Courtesy of Seegrid)

(Image Courtesy of Adobe)

The decision to implement a multi-modal and flexible system traces back to a concept called "Situational Awareness" which we learned about when exploring analogous domains like AMRs in warehousing and physicians in examination rooms. We saw how the modality of the interactions needed to support user capabilities in different situations.

The environment (either the office or the data hall) and the context of a technician's work have varying demands. Therefore, designing interactions for both voice and text enables technicians to maintain focus on their work to avoid errors that occur when under extreme cognitive stress.

DESIGN PRINCIPLE #2

To enable effective information retrieval, information within Nova must be relevant, digestible, and searchable.

"A core capability of a large language model, or LLM, is its ability to recognize, summarize, predict, and generate text and other forms of content based on knowledge gained from massive datasets."

- Angie Lee, NVIDIA Blog

We ran user tests where we asked participants to document important steps after they completed a task. We consolidated documentation across all participants using ChatGPT and provided this information as instructions to a new set of participants. Through this test, we wanted to see if the time taken to complete the task was reduced while maintaining consistency in task replicability. Even with a setup as simple as this, we saw a 16% reduction in the time taken to finish the task using the generated set of steps while task replicability was consistent.

Surfacing relevant information immediately builds trust between the technician and Nova. Moreover, it allows Nova to feel more like a helpful co-worker rather than a chatbot; Nova knows what you will need before you even need it.

Various commands accessed via the "/" key ensure information is always available and easily searchable. The commands also serve as a reminder to the user of what Nova is capable of. Additionally, these commands help to focus and optimize manual user searches to reduce the likelihood of the LLM hallucinating.

DESIGN PRINCIPLE #3

To enable the capture of knowledge, the documentation process must have as little friction as possible.

Currently, engineers will overwhelmingly prioritize resolution over documentation due to time pressures from SLAs and the effort required to document. We cannot change anything about the SLAs, but we can integrate documentation into their existing workflow to minimize the additional effort needed to document.

Automatic and frictionless documentation means technicians are not burdened with additional cognitive load. Additionally, real-time documentation means that technicians no longer need to rely on recall which inevitably results in less detail and higher errors.

Leveraging an LLM's capabilities of parsing through large quantities of data, we can also introduce methods of allowing technicians to effortlessly add to existing documentation. Our user testing showed that it was nearly 5 times faster for participants to make changes to automatically generated documentation than it was to document from scratch.

During our research, we found that when participants documented freely, the resulting content did not support others effectively. As indicated in the quote below, there is only one sentence that indicates an actionable item.

""I first started to look for a main function, which was not present. So then I saw a loop which seemed to modify a bunch of values, so I figured that's where the crux of the logic is. So I went through that, and it mostly seemed fine. Then finally, I checked the constants again. Usually, I will check the constants first, but I don't know why I did this in a reverse way."

- Free Documentation

We needed a way of structuring their methods of sharing information. We found questions like "What are the most important things someone else attempting to solve this problem should keep in mind?" to be incredibly helpful for users to reflect on their process. Their responses in turn provided the system with knowledge that it could then include in its generated responses and distribute to future users. Ultimately, this allows Nova to capture not just what a technician did, but why they did it.

There was a comment at the top of the code that said the variables referred to the pin number on the Arduino board. For example, the variable that defined "blue" was 3, and then I saw that the variable did not match the pin. I could either change the variable or the position of the pin."

- Structured Documentation

MOTIVATIONS

Emphasizing Long-Term Learning & Growth

Fogg's behavior model asserts that in order for a target behavior to be achieved, they must have the ability to perform the behavior as well as be sufficiently motivated.

Ability

(Ease of Use)

+

Motivation

(Willingness to Use)

=

Target Behavior

(Nova Usage)

Nova's capabilities like automatic documentation capture, easy information retrieval, and the command prompt window all contribute to ease of use ("ability").

To ensure technicians will be willing to use Nova ("motivation"), we want users to feel like they're growing with the system. We have achieved this by allowing technicians to export reports to share with managers for performance reviews, learn about new topics, and feel as though they are contributing to the knowledge community. Our design shifts the culture away from an environment focused on task-based outputs to one that emphasizes learning and growth.

IMPACT

Generating Value for the Business.

Business Impact #1

Nova enables a significant reduction in task resolution time as well as greater utilization of human resources.

Business Impact #2

Nova offers an opportunity to effectively use LLMs to integrate existing knowledge, develop documentation, and capture human though processes.

Business Impact #3

Nova provides a foundation for future autonomous data centers by capturing user actions and thought processes.

RESEARCH

A Peek Into Our Research Methods & Processes

Between January and May, our team set out on a journey to understand the intricate data center ecosystem through methods ranging from expert interviews, visits to data centers, literature reviews, and analogous domain research. This is a peek into our methods and process.

STAGE #1

Sense Making

What is the ecosystem like in data centers? What are standard procedures? Who plays what roles? With every little thing we learned, there was a flurry of questions that followed. We quickly scheduled visits with local data centers and began conducting interviews with everyone and anyone with data center experience.

Our visit to CMU's data center

Mapping out relationships and responsibilities between relevant stakeholders

“How might you hide the complexity of the system to support ease of use; keeping in mind the limitations and capabilities of the people who would be using the system regularly?”

- AMR Designer

We learned that data centers are highly fluid and dynamic ecosystems, comprising a wide range of people, software, and hardware. And, what makes the space unique are the contextual challenges like the size of the space, ambient noises, and high visual information density.

One of the many activities we designed and conducted to learn about the data center ecosystem in the early stages of the project

STAGE #2

Connecting the Dots

With multiple data center visits under our belt, extensive background research, and discussions with a variety of robotics experts we needed to begin bringing these isolated learnings together and think deeply about how they connect with one another. Untangling the mass of information actually helped us broaden our scope of opportunities. Some things that we began to hone in on included facilitating efficient repair and maintenance through HRI and just-in-time delivery of the training process.

Affinity Diagramming

Screen Shot 2023-03-22 at 12.36.58 AM (1).png

Finding the problem(s) worth solving alongside users

A few notable patterns and takeaways began to emerge...

Key Takeaway #1

Keeping updated digital records requires considerable effort.

“Everything that happens within the data center is incredibly dynamic. Things are changing all the time, so it’s really hard to have an up-to-date model stored in a digital twin.

- Data Center Engineer

Key Takeaway #2

Introducing robotics or automation may require significant investment in shifting worker perception of the tool.

“It was important for us to reframe the story of automation as getting robots, instead of people, to do the ‘dumb, dangerous, and dirty’ tasks.”

- AMR Designer

Key Takeaway #3

Improving upon remote diagnosis processes is of utmost importance.

“If there is an opportunity to automate diagnosis, we try to do it. We invest a lot into improving our tools so we can diagnose issues without human interaction. But it is an ongoing process because we keep adding new platforms.”

- Data Center Engineer

STAGE #3

Scoping In

Knowledge capture, distribution, and execution is an issue that is prevalent across a multitude of other industries - especially those that involve complex, high-stakes environments with a worker hierarchy. The obvious examples include healthcare and hospitals, but it also extends to warehousing, aviation, and manufacturing.

Through interviews with experts in each of these fields, alongside research methods and activities, like the Abstraction Ladder and Body Storming, with two on-site engineers, we identified points of commonality, probed on what has worked and failed in each of these analogous domains, and synthesized these learnings to apply them to data centers.

Up and down the Abstraction Ladder

For us to design an effective solution that addresses the issues regarding the capture of information, we must ensure that our solution has the characteristics of being real-time, integrated, and time efficient.

To address the issues outlined with knowledge retrieval within data centers, our solution must allow knowledge to be easily digestible, accessible, transparent, and searchable.

In terms of using knowledge, this system needs to ensure knowledge is up-to-date, easily accessible, and centralized to enable users to quickly and easily access the latest and most relevant information.

We proposed that knowledge within the data center exists in three different states and it's the people working within the data center that tie them all together. So, we focused our story on how the lack of documentation of information within informal channels not only leads to increased downtime but also directly affects the replicability of efficient processes.

A model of where knowledge is located within a data center

"A simple question could be asked and answered multiple times, but because it is within a work chat, that information is never made available unless explicitly disseminated."

STAGE #4

Stepping into Design & Implementation

As we move into the summer months, which is the “Design” Phase of our project, there is a lot of ambiguity in how we want to implement our designs. In order to mitigate the long-term risk of our product not generating value for the users and business, our team is committed to iterating and testing quickly.

But, before even thinking about what to build, the more important question was - “Why do we need to build it?”. There were key unanswered questions that presented risks to our project so it was important that we were building the right things. To help us focus on these questions, we developed a series of metrics and design principles.

The physical comfort of the design should not impact the user’s ergonomic work experience and the quality of work that they produce.
Users should not feel as though they might be compromised or reprimanded for their actions. They should feel emotionally secure that their skills are being valued.
The design should not introduce additional mental effort for users.
Users should feel like they can use designs despite contextual constraints such as noise level, size of space, or PPE requirements.

Our Metrics, which we paired with our design principles to evaluate concepts and prototypes

Finding creative ways of taking a sketch/concept to a prototype that people can interact with

Following testing, our team reconvenes to reflect on the successes and failures. This is an incredibly important part of our process as combing through the qualitative and quantitative data we gathered during testing informs us on how to move forward.

Notes from a discussion about the Roses, Buds, and Thorns of various concepts

Our Team

There will be plenty more to share in the coming weeks, so, stay tuned! Follow our progress on Medium as well!

Thomas' Design