Procedural Activity Representation and Online Mistake Detection from Egocentric Videos (2024)

Luigi Seminara  Giovanni Maria Farinella  Antonino Furnari

Department of Computer Science, University of Catania
luigi.seminara@phd.unict.it,{giovanni.farinella,antonino.furnari}@unict.it

Abstract

Procedural activities are sequences of key-steps aimed at achieving specific goals. They are crucial to build intelligent agents able to assist users effectively. In this context, task graphs have emerged as a human-understandable representation of procedural activities, encoding a partial ordering over the key-steps. While previous works generally relied on hand-crafted procedures to extract task graphs from videos, in this paper, we propose an approach based on direct maximum likelihood optimization of edges’ weights, which allows gradient-based learning of task graphs and can be naturally plugged into neural network architectures. Experiments on the CaptainCook4D dataset demonstrate the ability of our approach to predict accurate task graphs from the observation of action sequences, with an improvement of +16.7% over previous approaches. Owing to the differentiability of the proposed framework, we also introduce a feature-based approach, aiming to predict task graphs from key-step textual or video embeddings, for which we observe emerging video understanding abilities. Task graphs learned with our approach are also shown to significantly enhance online mistake detection in procedural egocentric videos, achieving notable gains of +19.8% and +7.5% on the Assembly101 and EPIC-Tent datasets. Code for replicating experiments is available at https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.

1 Introduction

Procedural activities are fundamental for humans to organize tasks, improve efficiency, and ensuring consistency in the desired outcomes, but require time and effort to be learned and achieved effectively. This makes the design of artificial intelligent agents able to assist users to correctly perform a task appealing[19, 27].Achieving these abilities requires building a flexible representation of a procedure, encapsulating knowledge on the partial ordering of key-steps arising from the specific context at hand. For example, a virtual assistant needs to understand that it is necessary to break eggs before mixing them or that the bike’s brakes need to be released before removing the wheel. Importantly, for a system to be scalable, this representation should be automatically learned from observations (e.g., humans making a recipe many times) rather than explicitly programmed by an expert.

Previous approaches focused on directly tackling tasks requiring procedural knowledge such as action anticipation[15, 13, 30] and mistake detection[33, 12, 6, 37, 14] without developing explicit representations of the procedure. Other works proposed neural models able to develop implicit representations of the procedure by learning how to recover missing actions[39, 25], discover key-steps[10, 4, 5], or grounding them to video[9, 22].A different approach[3, 9, 16] consists in representing the structure of a procedure in the form of a task graph, i.e., a Directed Acyclic Graph (DAG) in which nodes represent key-steps, and directed edges impose a partial ordering over key-steps, encoding dependencies between them (see Figure1(a)).Graphs provide an explicit representation which is readily interpretable by humans and easy to incorporate in downstream tasks such as detecting mistakes or validating the execution of a procedure.While graphs have been historically used to represent constraints in complex tasks and design optimal sub-tasks scheduling[34], graph-based representations mined from videos[3], key-step sequences[35, 18] or external knowledge bases[40] have only recently emerged as a powerful representation of procedural activities able to support downstream tasks such as key-step recognition or forecasting[3, 40].Despite these efforts, current methods rely on meticulously crafted graph mining procedures rather than setting graph generation in a learning framework, limiting the inclusion of task graph representations in end-to-end systems.

Procedural Activity Representation and Online Mistake Detection from Egocentric Videos (1)

In this work, we propose an approach to learn task graphs from demonstrations in the form of sequences of key-steps performed by real users in a video while executing a procedure.Given a directed graph represented as an adjacency matrix and a set of key-step sequences, we provide an estimate of the likelihood of observing the set of sequences given the constraints encoded in the graph.We hence formulate task graph learning under the well-understood framework of Maximum Likelihood (ML) estimation, and propose a novel differentiable Task Graph Maximum Likelihood (TGML) loss function which can be naturally plugged into any neural-based architecture for direct optimization of task graph from data.Intuitively, our TGML loss generates positive gradients to strengthen the weights of directed edges BA𝐵𝐴B\to Aitalic_B → italic_A when observing the <,A,,B,><\ldots,A,\ldots,B,\ldots>< … , italic_A , … , italic_B , … > structure, while pushing down the weights of all other edges in a contrastive manner (see Figure1(b).To evaluate the effectiveness of the proposed framework, we propose two approaches to task graph learning. The first approach, “Direct Optimization (DO)” uses the proposed TGML loss to directly optimize the weights of the adjacency matrix, which constitute the only parameters of the model. The output of the optimization procedure is the learned graph.The second approach, termed Task Graph Transformer (TGT) is a feature-based model which uses a transformer encoder and a relation head to predict the adjacency matrix from either text or video key-step embeddings.

We validate the ability of our framework to learn meaningful task graphs on the CaptainCook4D dataset[26]. Comparisons with state-of-the-art approaches show superior performance of both proposed approaches on task graph generation, with boosts of up to +16.7%percent16.7+16.7\%+ 16.7 % over prior methods.On the same dataset, we show that our feature-based approach implicitly gains video understanding abilities on two fundamental tasks[42]: pairwise ordering and future prediction.We finally assess the usefulness of the learned graph-based representation on the downstream task of online mistake detection in procedural egocentric videos.To tackle this task, we observe that procedural errors mainly arise from the execution of a given key-step without the correct execution of its pre-conditions.We hence design an approach which uses the learned graph to check whether pre-conditions for the current action are satisfied, signaling a mistake when they are not, obtaining significant gains of +19.8% and +7.5% in the online mistake detection benchmark recently introduced in[12] on Assembly101[33] and EPIC-Tent[17], showcasing the relevance and quality of the learned graph-based representations.

The contributions of this work are: 1) We introduce a novel framework for learning task graphs from action sequences, which relies on maximum likelihood estimation to provide a differentiable loss function which can be included in end-to-end models and optimized with gradient descent; 2) We propose two approaches to task graph learning based on direct optimization of the adjacency matrix and processing key-step text or video embeddings, which offer significant improvements over previous methods in task graph generation and shows emerging video understanding abilities;3) We showcase the usefulness of task graphs in general, and the learned graph-based representations in particular, on the downstream task of online mistake detection from video, where we improve over competitors.The code to replicate the experiments is available athttps://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.

2 Related Work

Procedure Understanding Previous investigations considered different tasks related to procedure understanding, such as inferring key-steps from video in an unsupervised way[41, 43, 11, 4, 5, 10], grounding key-steps in procedural video[22, 8, 9, 24], recognizing the performed procedure[21], inferring key-step orderings[4, 5, 22, 9, 39], and procedure structure verification[25]. Recently, task graphs, mined from video or external knowledge such as WikiHow articles, have been investigated as a powerful representation of procedures and proved advantageous for learning representations useful for downstream tasks such as key-step recognition and forecasting[40, 3].

Differently from previous works[25, 39], we aim to develop an explicit and human readable representation of the procedure which can be directly plugged in to enable downstream tasks[3], rather than an implicit representation obtained with pre-training objective[40, 25].As a departure from previous paradigms which carefully designed task graph construction procedures[3, 40, 35, 18], we frame task prediction in a general learning framework, enabling models to learn task graphs directly from input sequences, and propose a differentiable loss function based on maximum likelihood.

Task Graph Construction A line of works investigated the construction of task graphs from natural language descriptions of procedures (e.g., recipes) using rule-based graph parsing[32, 9], defining probabilistic models[20], fine-tuning language models[31], or proposing learning-based approaches[9] involving parsers and taggers trained on text corpora of recipes[7, 38]. While these approaches do not require any action sequence as input, they depend on the availability of text corpora including procedural knowledge, such as recipes, which often fail to encapsulate the variety of ways in which the procedure may be executed[3].Other works proposed hand-crafted approaches to infer task graphs observing sequences of actions depicting task executions[18, 35]. Recent work designed procedures to mine task graphs from videos and textual descriptions of key-steps[3] or cross-referencing visual and textual representations from corpora of procedural text and videos[40].

Differently from previous efforts, we rely on action sequences, grounded in video, rather than natural language descriptions of procedures or recipes[31, 9] and frame task graph generation as a learning problem, providing a differentiable objective rather than resorting to hand-designed algorithms and task extraction procedures[18, 35, 3, 40].

Online Mistake Detection in Procedural Videos Despite the interest in procedural learning, mistake detection has been systematically investigated only recently.Some methods considered fully supervised scenarios in which mistakes are explicitly labeled in video and mistake detection is performed offline[33, 37, 26]. Other approaches considered weak supervision, with mistakes being labeled only at the video level[14]. Finer-grade spatial and temporal annotations are exploited in[6] to build knowledge graphs, which are then leveraged to perform mistake detection.Recently, the authors of[12] proposed an online mistake detection benchmark incorporating videos from the Assembly101[33] and EPIC-Tent[17] datasets, as well as PREGO, an approach to online mistake detection in procedural egocentric videos.

Rather than addressing online mistake detection with implicit representations[12] or carefully designed knowledge bases[33], we design a simple approach which relies on learned explicit task graph representations. As we show in the experiments, this leads to obtain significant performance gains over previous methods, even when the predicted graphs are suboptimal, while best results are obtained with task graphs learned within the proposed framework.

3 Technical Approach

3.1 Task Graph Maximum Likelihood Learning Framework

Preliminaries Let 𝒦={K0=S,K1,,Kn,Kn+1=E}𝒦formulae-sequencesubscript𝐾0𝑆subscript𝐾1subscript𝐾𝑛subscript𝐾𝑛1𝐸\mathcal{K}=\{K_{0}=S,K_{1},\ldots,K_{n},K_{n+1}=E\}caligraphic_K = { italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_S , italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_E } be the set of key-steps involved in the procedure, where S𝑆Sitalic_S and E𝐸Eitalic_E are placeholder “start” and “end” key-steps denoting the start and end of the procedure. We define the task graph as a directed acyclic graph, i.e., a tuple G=(𝒦,𝒜,ω)𝐺𝒦𝒜𝜔G=(\mathcal{K},\mathcal{A},\omega)italic_G = ( caligraphic_K , caligraphic_A , italic_ω ), where 𝒦𝒦\mathcal{K}caligraphic_K is the set of nodes (the key-steps), 𝒜=𝒦×𝒦𝒜𝒦𝒦\mathcal{A}=\mathcal{K}\times\mathcal{K}caligraphic_A = caligraphic_K × caligraphic_K is the set of possible directed edges indicating ordering constraints between pairs of key-steps, and ω:𝒜[0,1]:𝜔𝒜01\omega:\mathcal{A}\to[0,1]italic_ω : caligraphic_A → [ 0 , 1 ] is a function assigning a score to each of the edges in 𝒜𝒜\mathcal{A}caligraphic_A.An edge (Ki,Kj)𝒜subscript𝐾𝑖subscript𝐾𝑗𝒜(K_{i},K_{j})\in\mathcal{A}( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ caligraphic_A (also denoted as KiKjsubscript𝐾𝑖subscript𝐾𝑗K_{i}\to K_{j}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) indicates that Kjsubscript𝐾𝑗K_{j}italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is a pre-condition of Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (for instance mixcrack eggmixcrack egg\text{mix}\to\text{crack egg}mix → crack egg) with score ω(Ki,Kj)𝜔subscript𝐾𝑖subscript𝐾𝑗\omega(K_{i},K_{j})italic_ω ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). We assume normalized weights for outgoing edges, i.e., jw(Ki,Kj)=1isubscript𝑗𝑤subscript𝐾𝑖subscript𝐾𝑗1for-all𝑖\sum_{j}w(K_{i},K_{j})=1\forall i∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_w ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 1 ∀ italic_i.We also represent the graph G𝐺Gitalic_G as the adjacency matrix Z[0,1]n×n𝑍superscript01𝑛𝑛Z\in[0,1]^{n\times n}italic_Z ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT, where Zij=ω(Ki,Kj)subscript𝑍𝑖𝑗𝜔subscript𝐾𝑖subscript𝐾𝑗Z_{ij}=\omega(K_{i},K_{j})italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_ω ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). For ease of notation, we will denote the graph G=(𝒦,𝒜,ω)𝐺𝒦𝒜𝜔G=(\mathcal{K},\mathcal{A},\omega)italic_G = ( caligraphic_K , caligraphic_A , italic_ω ) simply with its adjacency matrix Z𝑍Zitalic_Z in the rest of the paper.We assume that a set of N𝑁Nitalic_N sequences 𝒴={y(k)}k=1N𝒴superscriptsubscriptsuperscript𝑦𝑘𝑘1𝑁\mathcal{Y}=\{y^{(k)}\}_{k=1}^{N}caligraphic_Y = { italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT showing possible orderings of the key-steps 𝒦𝒦\mathcal{K}caligraphic_K is available, where the generic sequence y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y is defined as a set of indexes to key-steps 𝒦𝒦\mathcal{K}caligraphic_K, i.e., y=<y0,,yt,,ym+1>y=<y_{0},\ldots,y_{t},\ldots,y_{m+1}>italic_y = < italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT >, with yt{0,,n+1}subscript𝑦𝑡0𝑛1y_{t}\in\{0,\ldots,n+1\}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , … , italic_n + 1 }. We further assume that each sequence starts with key-step S𝑆Sitalic_S and ends with key-step E𝐸Eitalic_E, i.e., y0=0subscript𝑦00y_{0}=0italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 and ym+1=n+1subscript𝑦𝑚1𝑛1y_{m+1}=n+1italic_y start_POSTSUBSCRIPT italic_m + 1 end_POSTSUBSCRIPT = italic_n + 1111In practice, we prepend/append S𝑆Sitalic_S and E𝐸Eitalic_E to each sequence. and note that different sequences y(i)superscript𝑦𝑖y^{(i)}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and y(j)superscript𝑦𝑗y^{(j)}italic_y start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT have in general different lengths. Since we are interested in modeling key-step orderings, we assume that sequences do not contain repetitions.222Since sequences may in practice contain repetitions, we map each sequence containing repetitions to multiple sequences with no repetitions (e.g., ABCAD(ABCD,BCAD)𝐴𝐵𝐶𝐴𝐷𝐴𝐵𝐶𝐷𝐵𝐶𝐴𝐷ABCAD\to(ABCD,BCAD)italic_A italic_B italic_C italic_A italic_D → ( italic_A italic_B italic_C italic_D , italic_B italic_C italic_A italic_D )).. We frame task graph learning as determining an adjacency matrix Z^^𝑍\hat{Z}over^ start_ARG italic_Z end_ARG such that sequences in 𝒴𝒴\mathcal{Y}caligraphic_Y can be seen as topological sorts of Z^^𝑍\hat{Z}over^ start_ARG italic_Z end_ARG. A principled way to approach this problem is to provide an estimate of the likelihood P(𝒴|Z)𝑃conditional𝒴𝑍P(\mathcal{Y}|Z)italic_P ( caligraphic_Y | italic_Z ) and choose the maximum likelihood estimate Z^=argmax𝑍P(𝒴|Z)^𝑍𝑍𝑃conditional𝒴𝑍\hat{Z}=\underset{Z}{\arg\max}\text{ }P(\mathcal{Y}|Z)over^ start_ARG italic_Z end_ARG = underitalic_Z start_ARG roman_arg roman_max end_ARG italic_P ( caligraphic_Y | italic_Z ).

Modeling Sequence Likelihood for an Unweighted Graph Let us consider the special case of an unweighted graph, i.e., Z¯{0,1}n×n¯𝑍superscript01𝑛𝑛\bar{Z}\in\{0,1\}^{n\times n}over¯ start_ARG italic_Z end_ARG ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT. We wish to estimate P(y|Z)𝑃conditional𝑦𝑍P(y|Z)italic_P ( italic_y | italic_Z ), the likelihood of the generic sequence y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y given graph Z𝑍Zitalic_Z.Formally, let Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the random variable related to the event “key-step Kytsubscript𝐾subscript𝑦𝑡K_{y_{t}}italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT appears at position t𝑡titalic_t in sequence y𝑦yitalic_y”. We can factorize the conditional probability P(y|Z)𝑃conditional𝑦𝑍P(y|Z)italic_P ( italic_y | italic_Z ) as:

P(y|Z)=P(Y0,,Y|y||Z)=P(Y0|Z)P(Y1|Y0,Z)P(Y|y||Y0,,Y|y|1,Z).𝑃conditional𝑦𝑍𝑃subscript𝑌0conditionalsubscript𝑌𝑦𝑍𝑃conditionalsubscript𝑌0𝑍𝑃conditionalsubscript𝑌1subscript𝑌0𝑍𝑃conditionalsubscript𝑌𝑦subscript𝑌0subscript𝑌𝑦1𝑍\begin{gathered}P(y|Z)=P(Y_{0},\ldots,Y_{|y|}|Z)=P(Y_{0}|Z)\cdot P(Y_{1}|Y_{0}%,Z)\cdot\ldots\cdot P(Y_{|y|}|Y_{0},\ldots,Y_{|y|-1},Z).\end{gathered}start_ROW start_CELL italic_P ( italic_y | italic_Z ) = italic_P ( italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT | italic_y | end_POSTSUBSCRIPT | italic_Z ) = italic_P ( italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_Z ) ⋅ italic_P ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Z ) ⋅ … ⋅ italic_P ( italic_Y start_POSTSUBSCRIPT | italic_y | end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT | italic_y | - 1 end_POSTSUBSCRIPT , italic_Z ) . end_CELL end_ROW(1)

We assume that the probability of observing a given key-step Kytsubscript𝐾subscript𝑦𝑡K_{y_{t}}italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT at position t𝑡titalic_t in y𝑦yitalic_y depends on the previously observed key-steps (Kyt1,,Ky0subscript𝐾subscript𝑦𝑡1subscript𝐾subscript𝑦0K_{y_{t-1}},\ldots,K_{y_{0}}italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT), but not on their ordering, i.e., the probability of observing a given key-step depends on whether its pre-conditions are satisfied, regardless of the order in which they have been satisfied.Under this assumption, we write P(Yt|Yt1,,Y0,Z)𝑃conditionalsubscript𝑌𝑡subscript𝑌𝑡1subscript𝑌0𝑍P(Y_{t}|Y_{t-1},\dots,Y_{0},Z)italic_P ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Z ) simply as P(Kyt|Kyt1,,Ky0,Z)𝑃conditionalsubscript𝐾subscript𝑦𝑡subscript𝐾subscript𝑦𝑡1subscript𝐾subscript𝑦0𝑍P(K_{y_{t}}|K_{y_{t-1}},\dots,K_{y_{0}},Z)italic_P ( italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Z ).Without loss of generality, in the following, we denote the current key-step as Ki=Kytsubscript𝐾𝑖subscript𝐾subscript𝑦𝑡K_{i}=K_{y_{t}}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the indexes of key-steps observed at time t𝑡titalic_t as 𝒥=𝒪(y,t)={yt1,,y0}𝒥𝒪𝑦𝑡subscript𝑦𝑡1subscript𝑦0\mathcal{J}=\mathcal{O}(y,t)=\{y_{t-1},\dots,y_{0}\}caligraphic_J = caligraphic_O ( italic_y , italic_t ) = { italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }, and the corresponding set of observed key-steps as K𝒥={Ki|i𝒥}subscript𝐾𝒥conditional-setsubscript𝐾𝑖𝑖𝒥K_{\mathcal{J}}=\{K_{i}|i\in\mathcal{J}\}italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT = { italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i ∈ caligraphic_J }.Similarly, we define 𝒥¯=𝒪(y,t)¯={0,,n+1}𝒪(y,t)¯𝒥¯𝒪𝑦𝑡0𝑛1𝒪𝑦𝑡\bar{\mathcal{J}}=\overline{\mathcal{O}(y,t)}=\{0,\ldots,n+1\}\setminus%\mathcal{O}(y,t)over¯ start_ARG caligraphic_J end_ARG = over¯ start_ARG caligraphic_O ( italic_y , italic_t ) end_ARG = { 0 , … , italic_n + 1 } ∖ caligraphic_O ( italic_y , italic_t ) and K𝒥¯subscript𝐾¯𝒥K_{\bar{\mathcal{J}}}italic_K start_POSTSUBSCRIPT over¯ start_ARG caligraphic_J end_ARG end_POSTSUBSCRIPT as the sets of indexes and corresponding key-steps unobserved at position t𝑡titalic_t, i.e., those which do not appear before ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the sequence.Given the factorization above, we are hence interested in estimating the general term P(Kyt|Kyt1,,Ky0)=P(Ki|K𝒥)𝑃conditionalsubscript𝐾subscript𝑦𝑡subscript𝐾subscript𝑦𝑡1subscript𝐾subscript𝑦0𝑃conditionalsubscript𝐾𝑖subscript𝐾𝒥P(K_{y_{t}}|K_{y_{t-1}},\dots,K_{y_{0}})=P(K_{i}|K_{\mathcal{J}})italic_P ( italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = italic_P ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT ).We can estimate the probability of observing key-step Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the set of observed key-steps K𝒥subscript𝐾𝒥K_{\mathcal{J}}italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT and the constraints imposed by Z¯¯𝑍\bar{Z}over¯ start_ARG italic_Z end_ARG, following Laplace’s classic definition of probability[23] as “the ratio of the number of favorable cases to the number of possible cases”.Specifically, if we were to randomly sample a key-step from 𝒦𝒦\mathcal{K}caligraphic_K following the constraints of Z¯¯𝑍\bar{Z}over¯ start_ARG italic_Z end_ARG, and having observed key-steps K𝒥subscript𝐾𝒥K_{\mathcal{J}}italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT, sampling Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT would be a favorable case if all pre-conditions of Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT were satisfied, i.e., if j𝒥¯Zij=0subscript𝑗¯𝒥subscript𝑍𝑖𝑗0\sum_{j\in\bar{\mathcal{J}}}Z_{ij}=0∑ start_POSTSUBSCRIPT italic_j ∈ over¯ start_ARG caligraphic_J end_ARG end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 (there are no pre-conditions in unobserved key-steps K𝒥¯subscript𝐾¯𝒥K_{\bar{\mathcal{J}}}italic_K start_POSTSUBSCRIPT over¯ start_ARG caligraphic_J end_ARG end_POSTSUBSCRIPT).Similarly, sampling a key-steps Khsubscript𝐾K_{h}italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is a “possible case” if j𝒥¯Zhj=0subscript𝑗¯𝒥subscript𝑍𝑗0\sum_{j\in\bar{\mathcal{J}}}Z_{hj}=0∑ start_POSTSUBSCRIPT italic_j ∈ over¯ start_ARG caligraphic_J end_ARG end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_h italic_j end_POSTSUBSCRIPT = 0. We can hence define the probability of observing key-step Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT after observing all key-steps K𝒥subscript𝐾𝒥K_{\mathcal{J}}italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT in a sequence as follows:

P(Ki|K𝒥,Z¯)=number of favorable casesnumber of possible cases=𝟙(j𝒥¯Z¯ij=0)h𝒥¯𝟙(j𝒥¯Z¯hj=0)𝑃conditionalsubscript𝐾𝑖subscript𝐾𝒥¯𝑍number of favorable casesnumber of possible cases1subscript𝑗¯𝒥subscript¯𝑍𝑖𝑗0subscript¯𝒥1subscript𝑗¯𝒥subscript¯𝑍𝑗0P(K_{i}|K_{\mathcal{J}},\bar{Z})=\frac{\text{number of favorable cases}}{\text%{number of possible cases}}=\frac{\mathds{1}(\sum_{j\in\bar{\mathcal{J}}}\bar{%Z}_{ij}=0)}{\sum_{h\in\bar{\mathcal{J}}}\mathds{1}(\sum_{j\in\bar{\mathcal{J}}%}\bar{Z}_{hj}=0)}italic_P ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT , over¯ start_ARG italic_Z end_ARG ) = divide start_ARG number of favorable cases end_ARG start_ARG number of possible cases end_ARG = divide start_ARG blackboard_1 ( ∑ start_POSTSUBSCRIPT italic_j ∈ over¯ start_ARG caligraphic_J end_ARG end_POSTSUBSCRIPT over¯ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h ∈ over¯ start_ARG caligraphic_J end_ARG end_POSTSUBSCRIPT blackboard_1 ( ∑ start_POSTSUBSCRIPT italic_j ∈ over¯ start_ARG caligraphic_J end_ARG end_POSTSUBSCRIPT over¯ start_ARG italic_Z end_ARG start_POSTSUBSCRIPT italic_h italic_j end_POSTSUBSCRIPT = 0 ) end_ARG(2)

where 𝟙()1\mathds{1}(\cdot)blackboard_1 ( ⋅ ) denotes the indicator function, and in the denominator, we are counting the number of key-steps that have not appeared yet are “possible cases” under the given graph Z𝑍Zitalic_Z. Likelihood P(y|Z)𝑃conditional𝑦𝑍P(y|Z)italic_P ( italic_y | italic_Z ) can be obtained by plugging Eq.(2) into Eq.(1).

Modeling Sequence Likelihood for a Weighted Graph In practice, to enable gradient-based learning, we consider the general case of a continuous adjacency matrix Z[0,1]n×n𝑍superscript01𝑛𝑛Z\in[0,1]^{n\times n}italic_Z ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT.We generalize the concept of “possible cases” discussed in the previous section with the concept of “feasibility of sampling a given key-step Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, having observed a set of key-steps K𝒥subscript𝐾𝒥K_{\mathcal{J}}italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT, given graph Z𝑍Zitalic_Z”, which we define as the sum of all weights of edges between observed key-steps K𝒥subscript𝐾𝒥K_{\mathcal{J}}italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT and Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: f(Ki|K𝒥,Z)=j𝒥Zij𝑓conditionalsubscript𝐾𝑖subscript𝐾𝒥𝑍subscript𝑗𝒥subscript𝑍𝑖𝑗f(K_{i}|K_{\mathcal{J}},Z)=\sum_{j\in\mathcal{J}}Z_{ij}italic_f ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT , italic_Z ) = ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_J end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.Intuitively, if key-step kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has many satisfied pre-conditions, we are more likely to sample it as the next key-step. We hence define P(Ki|K𝒥,Z)𝑃conditionalsubscript𝐾𝑖subscript𝐾𝒥𝑍P(K_{i}|K_{\mathcal{J}},Z)italic_P ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT , italic_Z ) as “the ratio of the feasibility of sampling Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the sum of the feasibilities of sampling any unobserved key-step”:

P(Ki|K𝒥,Z)=f(Ki|K𝒥,Z)h𝒥¯f(Kh|K𝒥,Z)=j𝒥Zijh𝒥¯j𝒥Zhj𝑃conditionalsubscript𝐾𝑖subscript𝐾𝒥𝑍𝑓conditionalsubscript𝐾𝑖subscript𝐾𝒥𝑍subscript¯𝒥𝑓conditionalsubscript𝐾subscript𝐾𝒥𝑍subscript𝑗𝒥subscript𝑍𝑖𝑗subscript¯𝒥subscript𝑗𝒥subscript𝑍𝑗P(K_{i}|K_{\mathcal{J}},Z)=\frac{f(K_{i}|K_{\mathcal{J}},Z)}{\sum_{h\in\bar{%\mathcal{J}}}f(K_{h}|K_{\mathcal{J}},Z)}=\frac{\sum_{j\in\mathcal{J}}Z_{ij}}{%\sum_{h\in\bar{\mathcal{J}}}\sum_{j\in\mathcal{J}}Z_{hj}}italic_P ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT , italic_Z ) = divide start_ARG italic_f ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT , italic_Z ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h ∈ over¯ start_ARG caligraphic_J end_ARG end_POSTSUBSCRIPT italic_f ( italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT caligraphic_J end_POSTSUBSCRIPT , italic_Z ) end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_J end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h ∈ over¯ start_ARG caligraphic_J end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_J end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_h italic_j end_POSTSUBSCRIPT end_ARG(3)
Procedural Activity Representation and Online Mistake Detection from Egocentric Videos (2)

Figure2 illustrates the computation of the likelihood in Eq.3.Plugging Eq.(3) into Eq.(1), we can estimate the likelihood of a sequence y𝑦yitalic_y given graph Z𝑍Zitalic_Z as:

P(y|Z)=P(S|Z)t=1|y|P(Kyt|K𝒪(y,t),Z)=t=1|y|j𝒪(y,t)Zytjh𝒪(y,t)¯j𝒪(y,t)Zhj.𝑃conditional𝑦𝑍𝑃conditional𝑆𝑍superscriptsubscriptproduct𝑡1𝑦𝑃conditionalsubscript𝐾subscript𝑦𝑡subscript𝐾𝒪𝑦𝑡𝑍superscriptsubscriptproduct𝑡1𝑦subscript𝑗𝒪𝑦𝑡subscript𝑍subscript𝑦𝑡𝑗subscript¯𝒪𝑦𝑡subscript𝑗𝒪𝑦𝑡subscript𝑍𝑗\displaystyle P(y|Z)=P(S|Z)\prod_{t=1}^{|y|}{P(K_{y_{t}}|K_{\mathcal{O}(y,t)},%Z)}=\prod_{t=1}^{|y|}\frac{\sum_{j\in\mathcal{O}(y,t)}Z_{y_{t}j}}{\sum_{h\in%\overline{\mathcal{O}(y,t)}}\sum_{j\in\mathcal{O}(y,t)}Z_{hj}}.italic_P ( italic_y | italic_Z ) = italic_P ( italic_S | italic_Z ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y | end_POSTSUPERSCRIPT italic_P ( italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_K start_POSTSUBSCRIPT caligraphic_O ( italic_y , italic_t ) end_POSTSUBSCRIPT , italic_Z ) = ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y | end_POSTSUPERSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_O ( italic_y , italic_t ) end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h ∈ over¯ start_ARG caligraphic_O ( italic_y , italic_t ) end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_O ( italic_y , italic_t ) end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_h italic_j end_POSTSUBSCRIPT end_ARG .(4)

Where we set P(Ky0|Z)=P(S|Z)=1𝑃conditionalsubscript𝐾subscript𝑦0𝑍𝑃conditional𝑆𝑍1P(K_{y_{0}}|Z)=P(S|Z)=1italic_P ( italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_Z ) = italic_P ( italic_S | italic_Z ) = 1 as sequences always start with the start node S𝑆Sitalic_S.

Task Graph Maximum Likelihood Loss Function Assuming that sequences y(i)𝒴superscript𝑦𝑖𝒴y^{(i)}\in\mathcal{Y}italic_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ caligraphic_Y are independent and identically distributed, we define the likelihood of 𝒴𝒴\mathcal{Y}caligraphic_Y given graph Z𝑍Zitalic_Z as follows:

P(𝒴|Z)=k=1|𝒴|P(y(k)|Z)=k=1|𝒴|t=1|y(k)|j𝒪(y(k),t)Zytjh𝒪(y(k),t)¯j𝒪(y(k),t)Zhj.𝑃conditional𝒴𝑍superscriptsubscriptproduct𝑘1𝒴𝑃conditionalsuperscript𝑦𝑘𝑍superscriptsubscriptproduct𝑘1𝒴superscriptsubscriptproduct𝑡1superscript𝑦𝑘subscript𝑗𝒪superscript𝑦𝑘𝑡subscript𝑍subscript𝑦𝑡𝑗subscript¯𝒪superscript𝑦𝑘𝑡subscript𝑗𝒪superscript𝑦𝑘𝑡subscript𝑍𝑗P(\mathcal{Y}|Z)=\prod_{k=1}^{|\mathcal{Y}|}{P(y^{(k)}|Z)}=\prod_{k=1}^{|%\mathcal{Y}|}\prod_{t=1}^{|y^{(k)}|}\frac{\sum_{j\in\mathcal{O}(y^{(k)},t)}Z_{%y_{t}j}}{\sum_{h\in\overline{\mathcal{O}(y^{(k)},t)}}\sum_{j\in\mathcal{O}(y^{%(k)},t)}Z_{hj}}.italic_P ( caligraphic_Y | italic_Z ) = ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_Y | end_POSTSUPERSCRIPT italic_P ( italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | italic_Z ) = ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_Y | end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_O ( italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_t ) end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_h ∈ over¯ start_ARG caligraphic_O ( italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_t ) end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_O ( italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_t ) end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_h italic_j end_POSTSUBSCRIPT end_ARG .(5)

We can find the optimal graph Z𝑍Zitalic_Z by maximizing the likelihood in Eq.(5), which is equivalent to minimizing the negative log-likelihood logP(𝒴,Z)𝑃𝒴𝑍-\log P(\mathcal{Y},Z)- roman_log italic_P ( caligraphic_Y , italic_Z ), leading to formulating the following loss:

(𝒴,Z)=k=1|Y|t=1|y(k)|(logj𝒪(y(k),t)Zytjβlogh𝒪(y(k),t)¯j𝒪(y(k),t)Zhj)𝒴𝑍superscriptsubscript𝑘1𝑌superscriptsubscript𝑡1superscript𝑦𝑘subscript𝑗𝒪superscript𝑦𝑘𝑡subscript𝑍subscript𝑦𝑡𝑗𝛽subscript¯𝒪superscript𝑦𝑘𝑡𝑗𝒪superscript𝑦𝑘𝑡subscript𝑍𝑗\displaystyle\mathcal{L}(\mathcal{Y},Z)=-\sum_{k=1}^{|Y|}\sum_{t=1}^{|y^{(k)}|%}\big{(}{\color[rgb]{0,1,1}\log{\sum_{\mathclap{j\in\mathcal{O}(y^{(k)},t)}}Z_%{y_{t}j}}}-\beta\cdot{\color[rgb]{0,.5,.5}\log{\sum_{\mathclap{\begin{subarray%}{c}h\in\overline{\mathcal{O}(y^{(k)},t)}\\{j\in\mathcal{O}(y^{(k)},t)}\end{subarray}}}Z_{hj}}}\big{)}caligraphic_L ( caligraphic_Y , italic_Z ) = - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_Y | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT ( roman_log ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_O ( italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_t ) end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_β ⋅ roman_log ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_h ∈ over¯ start_ARG caligraphic_O ( italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_t ) end_ARG end_CELL end_ROW start_ROW start_CELL italic_j ∈ caligraphic_O ( italic_y start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_t ) end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_h italic_j end_POSTSUBSCRIPT )(6)

where β𝛽\betaitalic_β is a hyper-parameter. We refer to Eq.(6) as the Task Graph Maximum Likelihood (TGML) loss function. Since Eq.(6) is differentiable with respect to all Zijsubscript𝑍𝑖𝑗Z_{ij}italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT values, we can learn the adjacency matrix Z𝑍Zitalic_Z by minimizing the loss with gradient descent to find the estimated graph Z^=argZmax(𝒴,Z)^𝑍subscript𝑍𝒴𝑍\hat{Z}=\arg_{Z}\max\mathcal{L}(\mathcal{Y},Z)over^ start_ARG italic_Z end_ARG = roman_arg start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT roman_max caligraphic_L ( caligraphic_Y , italic_Z ).Eq.(6) works as a contrastive loss in which the first logarithmic term aims to maximize, at every step t𝑡titalic_t of each input sequence, the weights Zytjsubscript𝑍subscript𝑦𝑡𝑗Z_{y_{t}j}italic_Z start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of edges KytKjsubscript𝐾subscript𝑦𝑡subscript𝐾𝑗K_{y_{t}}\to K_{j}italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT → italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT going from the current key-step Kytsubscript𝐾subscript𝑦𝑡K_{y_{t}}italic_K start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT to all previously observed key-steps Kjsubscript𝐾𝑗K_{j}italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, while the second logarithmic term (contrastive term) aims to minimize the weights of edges KhKjsubscript𝐾subscript𝐾𝑗K_{h}\to K_{j}italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT → italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT between key-steps yet to appear Khsubscript𝐾K_{h}italic_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and already observed key-steps Kjsubscript𝐾𝑗K_{j}italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.The hyper-parameter β𝛽\betaitalic_β regulates the influence of the summation in the contrastive term which, including many more addends, can dominate gradient updates.

3.2 Models

Direct Optimization (DO) The first model aims to directly optimize the parameters of the adjacency matrix by performing gradient descent on the TGML loss (Eq.(6)).We define the parameters of this model as an edge scoring matrix A(n+2)×(n+2)𝐴superscript𝑛2𝑛2A\in\mathbb{R}^{(n+2)\times(n+2)}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_n + 2 ) × ( italic_n + 2 ) end_POSTSUPERSCRIPT, where n𝑛nitalic_n is the number of key-steps, plus the placeholder start (S𝑆Sitalic_S) and end (E𝐸Eitalic_E) nodes, and Aijsubscript𝐴𝑖𝑗A_{ij}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is a score assigned to edge KiKjsubscript𝐾𝑖subscript𝐾𝑗K_{i}\rightarrow K_{j}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.To prevent the model from learning edge weights eluding the assumptions of directed acyclic graphs, we mask black cells in Figure2 with -\infty- ∞.To constrain the elements of Z𝑍Zitalic_Z in the [0,1]01[0,1][ 0 , 1 ] range and obtain normalized weights, we softmax-normalize the rows of the scoring matrix to obtain the adjacency matrix Z=softmax(A)𝑍𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝐴Z=softmax(A)italic_Z = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_A ). Note that elements masked with -\infty- ∞ will be automatically mapped to 00 by the softmax function similarly to[36]. We train this model by performing batch gradient descent directly on the score matrix A𝐴Aitalic_A with the proposed TGML loss. We train a separate model per procedure, as each procedure is associated to a different task graph.As many applications require an unweighted graph, we binarize the adjacency matrix with the threshold 1n1𝑛\frac{1}{n}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG, where n𝑛nitalic_n is the number of nodes. We also employ a post-processing stage in which we remove redundant edges, loops, and add obvious missing connections to S𝑆Sitalic_S and E𝐸Eitalic_E nodes.333See the supplementary material for more details.

Procedural Activity Representation and Online Mistake Detection from Egocentric Videos (3)

Task Graph Transformer (TGT) Figure3 illustrates the proposed model, which is termed Task Graph Transformer (TGT). The proposed model can take as input either D𝐷Ditalic_D-dimensional embeddings of textual descriptions of key-steps or D𝐷Ditalic_D-dimensional video embeddings of key-step segments extracted from video. In the first case, the model takes as input the same set of embeddings at each forward pass, while in the second case, at each forward pass, we randomly sample a video embedding per key-step from the training videos (hence each key-step embedding can be sampled from a different video). We also include two D𝐷Ditalic_D-dimensional learnable embeddings for the S𝑆Sitalic_S and E𝐸Eitalic_E nodes.All key-step embeddings are processed by a transformer encoder, which outputs D𝐷Ditalic_D-dimensional vectors enriched with information from other embeddings. To prevent representation collapse, we apply a regularization loss encouraging distinctiveness between pairs of different nodes. Let X𝑋Xitalic_X be the matrix of embeddings produced by the transformer model.We L2-normalize features, then compute pairwise cosine similarities Y=XXTexp(T)𝑌𝑋superscript𝑋𝑇𝑇Y=X\cdot X^{T}\cdot\exp(T)italic_Y = italic_X ⋅ italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ roman_exp ( italic_T ) as in[29].To prevent the transformer encoder from mapping distinct key-step embeddings to similar representations, we enforce the values outside the diagonal of Y𝑌Yitalic_Y to be smaller than the values in the diagonal. This is done by encouraging each row of the matrix Y𝑌Yitalic_Y to be close to a one-hot vector with a cross-entropy loss. Regularized embeddings are finally passed through a relation transformer head which considers all possible pairs of embeddings and concatenates them in a (n+2)×(n+2)×2D𝑛2𝑛22𝐷(n+2)\times(n+2)\times 2D( italic_n + 2 ) × ( italic_n + 2 ) × 2 italic_D matrix R𝑅Ritalic_R of relation vectors. For instance, R[i,j]𝑅𝑖𝑗R[i,j]italic_R [ italic_i , italic_j ] is the concatenation of vectors X[i]𝑋delimited-[]𝑖X[i]italic_X [ italic_i ] and X[j]𝑋delimited-[]𝑗X[j]italic_X [ italic_j ]. Relation vectors are passed to a transformer layer which aims to mine relationships among relation vectors, followed by a multilayer perceptron to reduce dimensionality to 16161616 units and another pair of transformer layer and multilayer perceptron to map relation vectors to scalar values, which are reshaped to size (n+2)×(n+2)𝑛2𝑛2(n+2)\times(n+2)( italic_n + 2 ) × ( italic_n + 2 ) to form the score matrix A𝐴Aitalic_A. We hence apply the same optimization procedure as in the DO method to supervise the whole architecture.

4 Experiments and Results

4.1 Graph Generation

Problem Setup We evaluate the ability of our approach to learn task graph representations on CaptainCook4D[26], a dataset of egocentric videos of 24 cooking procedures performed by 8 volunteers. Each procedure is accompanied by a task graph describing key-steps constraints.We tackle task graph generation as a weakly supervised learning problem in which models have to generate valid graphs by only observing labeled action sequences (weak supervision) rather than relying on task graph annotations (strong supervision), which are not available at training time.All models are trained on videos that are free from ordering errors or missing steps to provide a likely representation of procedures.We use the two proposed methods in the previous section to learn 24242424 task graph models, one per procedure, and report average performance across procedures.

MethodPrecisionRecallF1
MSGI [35]11.914.012.8
LLM52.957.455.0
Count-Based [3]66.956.161.0
MSG2 [18]70.971.671.1
TGT-text (Ours)71.7 ±3.6plus-or-minus3.6\pm 3.6± 3.672.9 ±2.8plus-or-minus2.8\pm 2.8± 2.872.1 ±3.2plus-or-minus3.2\pm 3.2± 3.2
DO (Ours)86.4 ±0.8plus-or-minus0.8\pm 0.8± 0.889.7 ±0.5plus-or-minus0.5\pm 0.5± 0.587.8 ±0.6plus-or-minus0.6\pm 0.6± 0.6
Improvement+15.5+18.1+16.7

MethodOrderingFut. Pred.
Random50.0050.00
TGT-video63.1162.19
Improvement+13.11+12.19

Compared Approaches We compare our methods with previous approaches to task graph generation, and in particular with MSGI [35] and MSG2 [18], which are approaches for task graph generation based on Inductive Logic Programming (ILP).We also consider the recent approach proposed in[3] which generates a graph by counting co-occurrences of matched video segments. Since we assume labeled actions to be available at training time, we do not perform video matching and use ground truth segment matching provided by the annotations. This approach is referred to as “Count-Based”. Given the popularity of large language models as reasoning modules, we also consider a baseline which uses a large language model444We base our experiments on ChatGPT[1]. to generate a task graph from key-step descriptions, without any access to key-step sequences.55{}^{\ref{note:supp2}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPTWe refer to this model as “LLM”.

Graph Generation Results Results in Table1 highlight the complexity of the task, with classic approaches based on inductive logic, such as MSGI, achieving poor performance (12.812.812.812.8 F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), language models and count-based statistics reconstructing only basic elements of the graph (55.055.055.055.0 and 61.061.061.061.0 F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for LLM and Count-Based respectively), and even more recent methods based on inductive logic and heuristics only partially predicting the graph (71.171.171.171.1 F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of MSG2𝑀𝑆superscript𝐺2MSG^{2}italic_M italic_S italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT). The proposed Direct Optimization (DO) approach outperforms all other methods, achieving the highest scores across all measures, with improvements in the [+15.5,+18.1]15.518.1[+15.5,+18.1][ + 15.5 , + 18.1 ] range with respect to the best competitor MSG2𝑀𝑆superscript𝐺2MSG^{2}italic_M italic_S italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This result highlights the effectiveness of the proposed framework to learn task graph representations from key-step sequences, especially considering the simplicity of the DO method, which performs gradient descent directly on the adjacency matrix. We obtain a slightly higher recall as compared to the precision (89.789.789.789.7 vs 86.486.486.486.4), showing that our approach tends to retrieve most ground truth edges, while hallucinating some pre-conditions, probably due to the dataset being unbalanced towards the most common ways of completing a procedure. Second best results are consistently obtained by our feature-based TGT approach, showing the generality of our learning framework and the potential of integrating it into complex neural architectures. Tight confidence intervals for DO highlight the stability of the proposed loss.The lower performance of TGT, as compared to DO, may be due to the relatively small size of the dataset, which makes it hard for complex architecture to generalize.

Video Understanding Results Table2 reports the performance of TGT trained on videos on two fundamental video understanding tasks[42] of pairwise clip ordering and future prediction.555See the supplementary material for more details. For pairwise ordering, we feed our TGT model with video embeddings of two clips and sort them according to the predicted adjacency matrix, placing first the clip identified as a pre-condition. For future predictions, given an anchor clip, we have to choose which among two other clips is the correct future. In this case, we feed the model with the three clips and select as future the clip with the smallest anchorclip𝑎𝑛𝑐𝑜𝑟𝑐𝑙𝑖𝑝anchor\to clipitalic_a italic_n italic_c italic_h italic_o italic_r → italic_c italic_l italic_i italic_p edge weight (future clips are not likely pre-conditions). Despite TGT not being explicitly trained for pairwise ordering and future predictions, it exhibits promising emerging video understanding abilities, surpassing the random baseline.

4.2 Online Mistake Detection

Problem Setup We follow the PREGO benchmark recently proposed in[12], in which models are tasked to perform online action detection from procedural egocentric videos. To evaluate the usefulness of task graphs on this downstream task, we design a system which flags the current action as a mistake if its pre-conditions in the predicted graph do not appear in previously observed actions.55{}^{\ref{note:supp2}}start_FLOATSUPERSCRIPT end_FLOATSUPERSCRIPT

Competitors We compare our approach with respect to the PREGO model proposed in[12], which detects mistakes based on the comparison between the currently observed action and an action predicted by a forecasting module. We note that PREGO is based on an implicit representation of the procedure (the forecasting module), while our approach is based on the explicit task graph representation, learned with the proposed framework. We also compare our approach with respect to baselines based on all graph prediction approaches compared in Table1 to assess how the ability to predict accurate graphs affects downstream performance. For all methods, we report results based on ground truth action segments and on action sequences predicted by a MiniRoad[2] instance, a state-of-the-art online action detection module trained on each target dataset.

Results Results in Table3 highlight the usefulness of the learned task graphs for downstream applications. The proposed DO method achieves significant gains over prior art with improvements of +19.819.8+19.8+ 19.8 and +7.57.5+7.5+ 7.5 in average F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score on Assembly101 and EPIC-Tent respectively when ground truth action sequences are considered to make predictions. While TGT is the second-best performer on Assembly101, it obtains best results on EPIC-Tent (64.164.164.164.1 vs 58.358.358.358.3 in average F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score). This is due to the nature of action annotations in the two datasets. Indeed, while key-step names are informative in EPIC-Tent (e.g., “Place Vent Cover”, “Open Stake Bag”, or “Spread Tent”), they are less distinctive in Assembly101 (e.g., “attach cabin”, “attach interior”, or “screw chassis”). This highlights the flexibility of the proposed learning framework which can work in purely abstract, symbolic settings, with the DO approach, but can also leverage semantics with TGT when beneficial. Interestingly, the second best performers are graph-based approaches, with MSG2𝑀𝑆superscript𝐺2MSG^{2}italic_M italic_S italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT achieving an average F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of 56.156.156.156.1 on Assembly101 and the simple Count-Based approach obtaining an average F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of 56.656.656.656.6 on EPIC-Tent. In contrast, PREGO obtains average F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores of 39.439.439.439.4 and 32.132.132.132.1 on Assembly101 and EPIC-Tent respectively, suggesting the potential of explicit graph-based representations for mistake detection, versus the implicit one of PREGO. Breaking down performance into correct and mistake F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores reveal some degree of unbalance of our approaches and the main competitor MSG2𝑀𝑆superscript𝐺2MSG^{2}italic_M italic_S italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT towards identifying correct actions rather than mistakes. This suggests that the related graph-based representations tend to detect some spurious pre-conditions, probably due to the limited demonstrations included in the videos, while the implicit PREGO model exhibits a skew with respect to mistakes. Further breaking down F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores into related precision and recall values highlights that the main failure modes are due to large imbalances between precision and recall. For instance, the Count-Based method achieves a precision of only 5.15.15.15.1 with a recall of 86.286.286.286.2 in predicting correct segments on Assembly101. In contrast, the proposed approach obtains balanced precision and recall values in detecting correct segments in Assembly101 (98.298.298.298.2/83.483.483.483.4) and EPIC-Tent (94.894.894.894.8/92.492.492.492.4), and detecting mistakes in EPIC-Tent (33.233.233.233.2/35.735.735.735.7), while the prediction of mistakes on Assembly101 is more skewed (46.746.746.746.7/90.490.490.490.4).Results based on action sequences predicted from video (bottom part of Table3) highlight the challenging nature of the task, when considering noisy action sequences. While the explicit task graph representation may not accurately reflect the predicted noisy action sequences, we still observe improvements over previous approaches of +7.37.3+7.3+ 7.3 and +1.31.3+1.3+ 1.3 in average F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score in Assembly101 and EPIC-Tent. Remarkably, best competitors are still graph-based methods, such as MSG2𝑀𝑆superscript𝐺2MSG^{2}italic_M italic_S italic_G start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and the Count-Based approach, with significant improvements over the implicit representation of the PREGO model (32.532.532.532.5 average F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT versus 53.553.553.553.5 of the proposed DO model). Also in this case, we observe that graph-based methods tend to be skewed towards detecting correct action sequences. In this regard, our TGT model only achieves 38.238.238.238.2 mistake F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score, a drop in 5.55.55.55.5 points over the best performer, the Count-Based method, which, on the other hand, only achieves an F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score of 2.52.52.52.5 when predicting correct segments.

Assembly101EPIC-Tent
AvgCorrectMistakeAvgCorrectMistake
MethodF1F1PrecRecF1PrecRecF1F1PrecRecF1PrecRec
Count-Based [3]26.29.55.186.242.997.827.556.692.592.892.220.720.021.4
LLM29.315.18.387.243.496.727.947.786.382.490.69.113.36.9
MSGI [35]33.122.713.184.443.593.428.344.566.951.695.222.073.312.9
PREGO [12]39.432.689.719.946.330.794.032.145.095.729.419.110.786.7
MSG2∗ [18]56.163.951.584.248.273.635.854.192.994.191.715.413.318.2
TGT-text (Ours)62.869.856.890.655.784.141.764.193.894.193.534.533.335.7
DO (Ours)75.990.298.283.461.646.790.458.393.594.892.423.120.027.3
Improvement+19.8+26.3+13.4+7.5+0.9+12.5
Count-Based+ [3]23.12.51.360.043.797.828.138.168.354.990.17.926.74.7
LLM+28.115.17.865.542.389.527.735.961.646.790.410.240.05.8
MSGI+ [35]28.414.07.867.942.790.728.040.459.242.995.521.680.012.5
PREGO+ [12]32.523.168.813.941.827.884.129.441.697.926.417.29.593.3
MSG2+ [18]46.259.151.270.033.244.526.545.267.552.495.122.973.313.6
TGT-text (Ours)+53.067.862.374.538.246.232.643.869.555.892.118.253.311.0
DO (Ours)+53.578.985.073.528.122.537.346.569.354.495.223.773.314.1
Improvement++7.3+19.8-5.5+1.3+1.2+1.2

5 Limitations

The proposed approach requires the availability of key-step sequences, a common assumption of works addressing other video understanding tasks such as action recognition. While our method is applicable to any fully supervised video understanding dataset, future works should focus on overcoming such limitation and taking advantage of the vast amount of unlabeled video and textual data sets. While the proposed TGT method has shown promising results when trained directly on video features, the investigation of task graph learning in the absence of labeled key-step sequences is beyond the scope of this paper. We noted a reduced ability of our approach to work with noisy action sequences and a tendency to hallucinate pre-conditions, likely due to the limited expressivity of key-steps sequences arising from videos showing the most common ways to perform a procedure.

6 Conclusion

We considered the problem of learning task graph representations of procedures from video demonstrations. Framing task graph learning as a maximum likelihood estimation problem, we proposed a differentiable loss which allows direct optimization of the adjacency matrix through gradient descent and can be plugged into more complex neural network architectures. Experiments on three datasets show that the proposed approach can learn accurate task graphs, develop video understanding abilities, and improve the downstream task of online mistake detection surpassing state of the art methods.We release our code at the following URL: https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.

7 Acknowledgments

This research is supported in part by the PNRR PhD scholarship “Digital Innovation: Models, Systems and Applications” DM 118/2023, by the project Future Artificial Intelligence Research (FAIR) – PNRR MUR Cod. PE0000013 - CUP: E63C22001940006, and by the Research Program PIAno di inCEntivi per la Ricerca di Ateneo 2020/2022 — Linea di Intervento 3 “Starting Grant” EVIPORES Project - University of Catania.

We thank the authors of[12] and in particular Alessandro Flaborea and Guido D’Amely for sharing the code to replicate experiments in the PREGO benchmark.

References

  • [1]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal.Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023.
  • [2]Joungbin An, Hyolim Kang, SuHo Han, Ming-Hsuan Yang, and SeonJoo Kim.Miniroad: Minimal rnn framework for online action detection.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10341–10350, 2023.
  • [3]Kumar Ashutosh, SanthoshKumar Ramakrishnan, Triantafyllos Afouras, and Kristen Grauman.Video-mined task graphs for keystep recognition in instructional videos.Advances in Neural Information Processing Systems, 36, 2024.
  • [4]Siddhant Bansal, Chetan Arora, and CVJawahar.My view is the best view: Procedure learning from egocentric videos.In European Conference on Computer Vision, pages 657–675. Springer, 2022.
  • [5]Siddhant Bansal, Chetan Arora, and CVJawahar.United we stand, divided we fall: Unitygraph for unsupervised procedure learning from videos.In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6509–6519, 2024.
  • [6]Guodong Ding, Fadime Sener, Shugao Ma, and Angela Yao.Every mistake counts in assembly.arXiv preprint arXiv:2307.16453, 2023.
  • [7]Lucia Donatelli, Theresa Schmidt, Debanjali Biswas, Arne Köhn, Fangzhou Zhai, and Alexander Koller.Aligning actions across recipe graphs.In Proceedings of the 2021 conference on empirical methods in natural language processing, pages 6930–6942, 2021.
  • [8]Mikita Dvornik, Isma Hadji, KonstantinosG Derpanis, Animesh Garg, and Allan Jepson.Drop-dtw: Aligning common signal between sequences while dropping outliers.Advances in Neural Information Processing Systems, 34:13782–13793, 2021.
  • [9]Nikita Dvornik, Isma Hadji, Hai Pham, Dhaivat Bhatt, Brais Martinez, Afsaneh Fazly, and AllanD Jepson.Graph2vid: Flow graph to video grounding for weakly-supervised multi-step localization.In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
  • [10]Nikita Dvornik, Isma Hadji, Ran Zhang, KonstantinosG Derpanis, RichardP Wildes, and AllanD Jepson.Stepformer: Self-supervised step discovery and localization in instructional videos.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18952–18961, 2023.
  • [11]Ehsan Elhamifar and Dat Huynh.Self-supervised multi-task procedure learning from instructional videos.In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 557–573. Springer, 2020.
  • [12]Alessandro Flaborea, Guido MariaD’Amely diMelendugno, Leonardo Plini, Luca Scofano, Edoardo DeMatteis, Antonino Furnari, GiovanniMaria Farinella, and Fabio Galasso.Prego: online mistake detection in procedural egocentric videos.In International Conference on Computer Vision and Patter Recognition (CVPR), 2024.
  • [13]Antonino Furnari and GiovanniMaria Farinella.Rolling-unrolling lstms for action anticipation from first-person video.IEEE transactions on pattern analysis and machine intelligence, 43(11):4021–4036, 2020.
  • [14]Reza Ghoddoosian, Isht Dwivedi, Nakul Agarwal, and Behzad Dariush.Weakly-supervised action segmentation and unseen error detection in anomalous instructional videos.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10128–10138, 2023.
  • [15]Rohit Girdhar and Kristen Grauman.Anticipative video transformer.In Proceedings of the IEEE/CVF international conference on computer vision, pages 13505–13515, 2021.
  • [16]Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, etal.Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives.arXiv preprint arXiv:2311.18259, 2023.
  • [17]Youngkyoon Jang, Brian Sullivan, Casimir Ludwig, Iain Gilchrist, Dima Damen, and Walterio Mayol-Cuevas.Epic-tent: An egocentric video dataset for camping tent assembly.In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
  • [18]Yunseok Jang, Sungryull Sohn, Lajanugen Logeswaran, Tiange Luo, Moontae Lee, and Honglak Lee.Multimodal subtask graph generation from instructional videos.arXiv preprint arXiv:2302.08672, 2023.
  • [19]Takeo Kanade and Martial Hebert.First-person vision.Proceedings of the IEEE, 100(8):2442–2453, 2012.
  • [20]Chloé Kiddon, GanesaThandavam Ponnuraj, Luke Zettlemoyer, and Yejin Choi.Mise en place: Unsupervised interpretation of instructional recipes.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 982–992, 2015.
  • [21]Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, and Lorenzo Torresani.Learning to recognize procedural activities with distant supervision.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13853–13863, 2022.
  • [22]Zijia Lu and Ehsan Elhamifar.Set-supervised action learning in procedural task videos via pairwise order consistency.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19903–19913, 2022.
  • [23]PierreSimon Marquisde Laplace.Théorie analytique des probabilités, volume7.Courcier, 1820.
  • [24]Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman.End-to-end learning of visual representations from uncurated instructional videos.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9879–9889, 2020.
  • [25]Medhini Narasimhan, Licheng Yu, Sean Bell, Ning Zhang, and Trevor Darrell.Learning and verification of task structure in instructional videos.arXiv preprint arXiv:2303.13519, 2023.
  • [26]Rohith Peddi, Shivvrat Arya, Bharath Challa, Likhitha Pallapothula, Akshay Vyas, Jikai Wang, Qifan Zhang, Vasundhara Komaragiri, Eric Ragan, Nicholas Ruozzi, etal.Captaincook4d: A dataset for understanding errors in procedural activities.arXiv preprint arXiv:2312.14556, 2023.
  • [27]Chiara Plizzari, Gabriele Goletto, Antonino Furnari, Siddhant Bansal, Francesco Ragusa, GiovanniMaria Farinella, Dima Damen, and Tatiana Tommasi.An outlook into the future of egocentric vision.International Journal fn Computer Vision, 2023.
  • [28]Shraman Pramanick, Yale Song, Sayan Nag, KevinQinghong Lin, Hardik Shah, MikeZheng Shou, Rama Chellappa, and Pengchuan Zhang.Egovlpv2: Egocentric video-language pre-training with fusion in the backbone.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023.
  • [29]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, etal.Learning transferable visual models from natural language supervision.In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [30]Debaditya Roy, Ramanathan Rajendiran, and Basura Fernando.Interaction region visual transformer for egocentric action anticipation.In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6740–6750, 2024.
  • [31]Keisuke Sakaguchi, Chandra Bhagavatula, Ronan LeBras, Niket Tandon, Peter Clark, and Yejin Choi.proScript: Partially ordered scripts generation.In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2138–2149, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  • [32]Pol Schumacher, Mirjam Minor, Kirstin Walter, and Ralph Bergmann.Extraction of procedural knowledge from the web: A comparison of two workflow extraction approaches.In Proceedings of the 21st International Conference on World Wide Web, pages 739–747, 2012.
  • [33]Fadime Sener, Dibyadip Chatterjee, Daniel Shelepov, Kun He, Dipika Singhania, Robert Wang, and Angela Yao.Assembly101: A large-scale multi-view video dataset for understanding procedural activities.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022.
  • [34]StevenS Skiena.The algorithm design manual, volume2.Springer, 1998.
  • [35]Sungryull Sohn, Hyunjae Woo, Jongwook Choi, and Honglak Lee.Meta reinforcement learning with autonomous inference of subtask dependencies.arXiv preprint arXiv:2001.00248, 2020.
  • [36]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
  • [37]Xin Wang, Taein Kwon, Mahdi Rad, Bowen Pan, Ishani Chakraborty, Sean Andrist, Dan Bohus, Ashley Feniello, Bugra Tekin, FelipeVieira Frujeri, etal.Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world.In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270–20281, 2023.
  • [38]Yoko Yamakata, Shinsuke Mori, and JohnA Carroll.English recipe flow graph corpus.In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5187–5194, 2020.
  • [39]Yiwu Zhong, Licheng Yu, Yang Bai, Shangwen Li, Xueting Yan, and Yin Li.Learning procedure-aware video representation from instructional videos and their narrations.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14825–14835, 2023.
  • [40]Honglu Zhou, Roberto Martín-Martín, Mubbasir Kapadia, Silvio Savarese, and JuanCarlos Niebles.Procedure-aware pretraining for instructional video understanding.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10727–10738, 2023.
  • [41]Luowei Zhou, Chenliang Xu, and Jason Corso.Towards automatic learning of procedures from web instructional videos.In Proceedings of the AAAI Conference on Artificial Intelligence, volume32, 2018.
  • [42]Yipin Zhou and TamaraL Berg.Temporal perception and prediction in ego-centric video.In Proceedings of the IEEE International Conference on Computer Vision, pages 4498–4506, 2015.
  • [43]Dimitri Zhukov, Jean-Baptiste Alayrac, RamazanGokberk Cinbis, David Fouhey, Ivan Laptev, and Josef Sivic.Cross-task weakly supervised learning from instructional videos.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019.

Appendix A Evaluation Measures

This appendix details the evaluation measures used to assess performance experimentally for the two considered tasks of task graph generation and online mistake detection.

Task Graph Generation

Task graph generation is evaluated by comparing a generated graph G^=(𝒦^,𝒜^)^𝐺^𝒦^𝒜\hat{G}=(\hat{\mathcal{K}},\hat{\mathcal{A}})over^ start_ARG italic_G end_ARG = ( over^ start_ARG caligraphic_K end_ARG , over^ start_ARG caligraphic_A end_ARG ) with a ground truth graph G=(𝒦,𝒜)𝐺𝒦𝒜G=(\mathcal{K},\mathcal{A})italic_G = ( caligraphic_K , caligraphic_A ). Since task graphs aim to encode ordering constraints between pairs of nodes, we evaluate task graph generation as the problem of identifying valid pre-conditions (hence valid graph edges) among all possible ones. We hence adopt classic detection evaluation measures such as precision, recall, and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score. In this context, we define True Positives (TP) as all edges included in both the predicted and ground truth graph (Eq.(7)), False Positives (FP) as all edges included in the predicted graph, but not in the ground truth graph (Eq.(8)), and False Negatives (FN) as all edges included in the ground truth graph, but not in the predicted one (Eq.(9)). Note that true negatives are not required to compute precision, recall and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT score.
TP=𝒜^𝒜𝑇𝑃^𝒜𝒜TP=\hat{\mathcal{A}}\cap\mathcal{A}italic_T italic_P = over^ start_ARG caligraphic_A end_ARG ∩ caligraphic_A(7)FP=𝒜^𝒜𝐹𝑃^𝒜𝒜FP=\hat{\mathcal{A}}\setminus\mathcal{A}italic_F italic_P = over^ start_ARG caligraphic_A end_ARG ∖ caligraphic_A(8)FN=𝒜𝒜^𝐹𝑁𝒜^𝒜FN=\mathcal{A}\setminus\hat{\mathcal{A}}italic_F italic_N = caligraphic_A ∖ over^ start_ARG caligraphic_A end_ARG(9)

Online Mistake Detection

We follow previous works on mistake detection from procedural egocentric videos[33, 37, 12] and evaluate online mistake detection with standard precision, recall, and F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT scores. We break down metrics by the “correct” and “mistake” classes, as well as report average values.

Appendix B Implementation Details

This appendix provides implementation details to replicate the experiments discussed in Section4.

B.1 Data Augmentation

In procedural tasks, it is common for certain actions to be repeated multiple times throughout the execution of a task. For example, in the EPIC-Tent dataset[17], an operation such as "reading the instructions" may be performed repeatedly at any point during the task. To model key-step orderings within the framework of topological sorts, our approach assumes that sequences should not contain such repetitions. Since repetitions denote that a specific action can appear at different stages of a procedure, we expand each sequence with repetitions to all distinct sequences obtained by dropping repeated actions.This data augmentation strategy enhances the robustness of our model on Assembly101[33] and EPIC-Tent[17], while it was not necessary for the CaptainCook4D dataset[26], as sequences do not contain any repetitions.

B.2 Early Stopping

The learning process was conducted without the use of a validation set. To avoid overfitting and saving computation we defined a “Sequence Accuracy (SA)” score used to determine when the model reaches a learning plateau. We early stop models when an SA value of at least 0.950.950.950.95 is reached, and if the model shows no SA improvement for 25252525 consecutive epochs. The SA score is as follows:

SA=1|𝒴|y𝒴1|y|i=0|y|1c(yi,y[:i],pred(yi))\text{SA}=\frac{1}{|\mathcal{Y}|}\sum_{y\in\mathcal{Y}}\frac{1}{|y|}\sum_{i=0}%^{|y|-1}c(y_{i},y[:i],pred(y_{i}))SA = divide start_ARG 1 end_ARG start_ARG | caligraphic_Y | end_ARG ∑ start_POSTSUBSCRIPT italic_y ∈ caligraphic_Y end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | italic_y | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_y | - 1 end_POSTSUPERSCRIPT italic_c ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y [ : italic_i ] , italic_p italic_r italic_e italic_d ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) )(10)

where 𝒴𝒴\mathcal{Y}caligraphic_Y defined sequences in the training set, y𝑦yitalic_y is a sequence from 𝒴𝒴\mathcal{Y}caligraphic_Y, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th element of sequence y𝑦yitalic_y, y[:i]y[:i]italic_y [ : italic_i ] are the predecessors of the i𝑖iitalic_i-th element in the sequence y𝑦yitalic_y, and pred(yi,Z)𝑝𝑟𝑒𝑑subscript𝑦𝑖𝑍pred(y_{i},Z)italic_p italic_r italic_e italic_d ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z ) are the predicted predecessors for yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the current binarized adjacency matrix Z𝑍Zitalic_Z. The function c𝑐citalic_c is defined as:

c(yi,y[:i],pred(yi,Z))={1if|y[:i]|=0and|pred(yi,Z)|=0|y[:i]pred(yi,Z)||pred(yi,Z)|if|y[:i]|>0and|pred(yi,Z)|>00otherwise\displaystyle c(y_{i},y[:i],pred(y_{i},Z))=\begin{cases}1&\text{if }|y[:i]|=0%\text{ and }|pred(y_{i},Z)|=0\\\frac{|y[:i]\cap pred(y_{i},Z)|}{|pred(y_{i},Z)|}&\text{if }|y[:i]|>0\text{ %and }|pred(y_{i},Z)|>0\\0&\text{otherwise}\end{cases}italic_c ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y [ : italic_i ] , italic_p italic_r italic_e italic_d ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z ) ) = { start_ROW start_CELL 1 end_CELL start_CELL if | italic_y [ : italic_i ] | = 0 and | italic_p italic_r italic_e italic_d ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z ) | = 0 end_CELL end_ROW start_ROW start_CELL divide start_ARG | italic_y [ : italic_i ] ∩ italic_p italic_r italic_e italic_d ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z ) | end_ARG start_ARG | italic_p italic_r italic_e italic_d ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z ) | end_ARG end_CELL start_CELL if | italic_y [ : italic_i ] | > 0 and | italic_p italic_r italic_e italic_d ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Z ) | > 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW(11)

The SA score measures the compatibility of each sequence with the current task graph based on the ratio of correctly predicted predecessors of the current symbol yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the sequence to the total number of predicted predecessors for yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the current task graph.

B.3 Hyperparameters

Table4 details the hyperparameters employed in the experiments for task graph generation on the CaptainCook4D dataset[26]. During the training of TGT, we utilized a pre-trained EgoVLPv2[28] to extract text and video embeddings. The temperature value T𝑇Titalic_T used in the cross-entropy distinctiveness loss was set to 0.90.90.90.9 as in[29].

For the downstream task of online mistake detection within the DO model framework, we extended the maximum training epochs to 1200 for Assembly101[33]. This change was necessary because, even after 1000 epochs, the model continued to exhibit many cycles among its 86 nodes. Extending the number of epochs allows the model additional time to learn and minimize these cycles, which is crucial given the complexity of the graph. In the TGT configuration, we preserved the hyperparameters consistent with those used in the CaptainCook4D settings, except for adjusting the beta values to 1.

You are referred to the code for additional implementation details.

Value
HyperparameterDOTGT
Learning Rate0.10.000001
Max Epochs10001000
OptimizerAdamAdam
β𝛽\betaitalic_β0.0050.55
Dropout Rate-0.1

B.4 LLM Prompt

Below is the prompt that was employed to instruct the model on its task, which involves identifying pre-conditions for given procedural steps.

I would like you to learn to answer questions by telling me the stepsthat need to be performed before a given one.The questions refer to procedural activities and these are of the following type:Q - Which of the following key steps is a pre-condition for the current key step"add brownie mix"?- add oil- add water- break eggs- mix all the contents- mix eggs- pour the mixture in the tray- spray oil on the tray- None of the aboveYour task is to use your immense knowledge and your immense ability to tell mewhich preconditions are among those listed that must necessarily be carried outbefore the key step indicated in quotes in the question.You have to give me the answers and a very brief explanation of why you chose them.Provide the correct preconditions answer inside a JSON format like this:{ "add brownie mix": ["add oil", "add water", "break eggs"]}

B.5 Data Split

The CaptainCook4D dataset[26] comprises various error types, including order errors, timing errors, temperature errors, preparation errors, missing steps errors, measurement errors, and technique errors. Of these, missing steps and order errors directly impact the sequence integrity. Consequently, for our task graph generation, we utilized only those sequences of actions free from these specific types of errors. Table5 shows statistics on the CaptainCook4D subsets used for task graph generation.

For Online Mistake Detection, we considered the datasets defined by the authors of PREGO[12].

In the context of pairwise ordering and forecasting, we employed the subset of the CaptainCook4D dataset designated for task graph generation (refer to Table5) and divided it into training and testing sets. This division was carefully managed to ensure that 50% of the scenarios were equally represented in both the training and testing sets.

ScenarioVideosSegmentsDuration
Microwave Egg Sandwich5600.9h
Dressed Up Meatballs81282.7h
Microwave Mug Pizza6841.2h
Ramen111652.7h
Coffee91442.2h
Breakfast Burritos8881.5h
Spiced Hot Chocolate7490.9h
Microwave French Toast111212.2h
Pinwheels5950.8h
Tomato Mozzarella Salad131171.3h
Butter Corn Cup5601.4h
Tomato Chutney5952.6h
Scrambled Eggs61382.6h
Cucumber Raita121322.7h
Zoodles6781.1h
Salted Mushrooms71262.9h
Blender Banana Pancakes101402.4h
Herb Omelet with Fried Tomatoes81202.4h
Broccoli Stir Fry102505.2h
Pan Fried Tofu91713.6h
Mug Cake91803.0h
Cheese Pimiento7771.6h
Spicy Tuna Avocado Wraps91532.6h
Caprese Bruschetta8882.4h
Total194285953.0h

B.6 Pairwise ordering and future prediction

We setup the pairwise ordering and future prediction video understanding tasks following[42].

Pairwise Ordering

Models take as input two randomly shuffled video clips and are tasked with recognizing the correct ordering between key-steps. We sample all consecutive triplets of labeled segments from test videos, discard the middle one, and consider the first and third ones as input pair. We evaluate models using accuracy.

Future Prediction

Models take as input an anchor video clip and two randomly shuffled video clips and are tasked to select which of the two clips is the correct future of the anchor clip. We sample all consecutive triplets of labeled segments from test videos and consider the middle clip as the anchor and the remaining two clips as the two options. We evaluate models using accuracy.

Model

We trained our TGT model using video embeddings extracted with a pre-trained EgoVLPv2[28]. During the training process, if multiple video embeddings are associated with the same key-step across the training sequences, one embedding per key-step is randomly selected. The model is trained for graph generation on the training video and tested for pairwise ordering and future prediction on the test set. For pairwise ordering, we feed our model with the two clips and obtain a 2×2222\times 22 × 2 adjacency matrix. We then select the ordering related to the largest off-diagonal weight. This weight denotes a AB𝐴𝐵A\to Bitalic_A → italic_B edge, suggesting a B,A𝐵𝐴B,Aitalic_B , italic_A order (B𝐵Bitalic_B is a precondition of A𝐴Aitalic_A). For future prediction, we feed the three clips and obtain a 3×3333\times 33 × 3 adjacency matrix.We hence inspect the weights of edges anchorA𝑎𝑛𝑐𝑜𝑟𝐴anchor\to Aitalic_a italic_n italic_c italic_h italic_o italic_r → italic_A and anchorB𝑎𝑛𝑐𝑜𝑟𝐵anchor\to Bitalic_a italic_n italic_c italic_h italic_o italic_r → italic_B and choose as future clip, the one related to smallest weight (a small weights indicates that the selected clip is not a precondition).

Considered Data

To evaluate the performance of high quality graph representations, for these experiments we consider the top-5 scenarios of CaptainCook4D, according to graph prediction performance obtained by our TGT model, namely: “Microwave Egg Sandwich”, “Ramen”, ‘Tomato Mozzarella Salad”, “Microwave French Toast”, “Coffee”.

B.7 Graph Post-processing

We binarize the adjacency matrix with the threshold 1n1𝑛\frac{1}{n}divide start_ARG 1 end_ARG start_ARG italic_n end_ARG, where n𝑛nitalic_n is the number of nodes. After this thresholding phase, it is possible to encounter situations like the one illustrated in Figure4, where node A depends on nodes B and C, and node B depends on node C. Due to the transitivity of the pre-conditions, we can remove the edge connecting node A to node C, as node B must precede node A. Sometimes, it may occur that a node does not serve as a pre-condition for any other node; in such cases, the END node should be directly connected to this node. Conversely, if a node has no pre-conditions, an edge is added from the current node to the START node.

At the end of the training process, obtaining a graph containing cycles is also possible. In such cases, all cycles within the graph are considered, and the edge with the lowest score within each cycle is removed. This method ensures that the graph remains a Directed Acyclic Graph (DAG).

B.8 Details on Online Mistake Detection

Given the noisy sequences in Assembly101[33] and EPIC-Tent[17], a distinct approach was adopted during the post-processing phase of task graph generation. Specifically, if a key-step in the task graph has only two pre-conditions and one is the START node, the other pre-condition will be removed regardless of its score, otherwise we apply the transitivity dependences reduction aforementioned. This approach allows for a graph with fewer pre-conditions in the initial steps.

B.9 Qualitative Examples

Figure5 reports a qualitative example of a prediction.

B.10 Experiments Compute Resources

The experiments involving the training of the DO model on symbolic data from the CaptainCook4D dataset proved to be highly efficient. We were able to generate all the task graphs in approximately half an hour using a Tesla V100S-PCI GPU. This GPU allowed us to run up to 8 training processes simultaneously. In contrast, training the TGT models for all scenarios in the CaptainCook4D dataset required about 24 hours, with the same GPU supporting the concurrent training of up to 2 models. Additionally, once the task graphs were obtained, executing the PREGO benchmarks for mistake detection was significantly faster, requiring online action prediction, which could be performed in real-time on a Tesla V100S-PCI GPU.

Procedural Activity Representation and Online Mistake Detection from Egocentric Videos (4)
Procedural Activity Representation and Online Mistake Detection from Egocentric Videos (5)

Appendix C Societal Impact

Reconstructing task graphs from procedural videos may enable the construction of agents able to assist users during the execution of the task. Learning task graphs from videos may be affected by geographical or cultural biases appearing in the data (e.g., specific ways of performing given tasks), which may limit the quality of the feedback returned to the user, potentially leading to harm. We expect that training data of sufficient quality should limit such risks.

Procedural Activity Representation and Online Mistake Detection from Egocentric Videos (2024)

References

Top Articles
Latest Posts
Article information

Author: Ms. Lucile Johns

Last Updated:

Views: 6164

Rating: 4 / 5 (61 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.