image
image
user-login
Patent search/

PROCESSING IMAGES USING SELF-ATTENTION BASED NEURAL NETWORKS

search

Patent Search in India

  • tick

    Extensive patent search conducted by a registered patent agent

  • tick

    Patent search done by experts in under 48hrs

₹999

₹399

Talk to expert

PROCESSING IMAGES USING SELF-ATTENTION BASED NEURAL NETWORKS

DIVISIONAL PCT NATIONAL PHASE APPLICATION

Published

date

Filed on 7 November 2024

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing images using self-attention based neural networks. One of the methods includes obtaining one or more images (102) comprising a plurality of pixels; determining, for each image of the one or more images (102), a plurality of image patches (112a-n) of the image, wherein each image patch comprises a different subset of the pixels of the image; processing, for each image of the one or more images (102), the corresponding plurality of image patches (112a-n) to generate an input sequence comprising a respective input element at each of a plurality of input positions, wherein a plurality of the input elements correspond to respective different image patches (112a-n); and processing the input sequences using a neural network (130) to generate a network output (152) that characterizes the one or more images (102), wherein the neural network (130) comprises one or more selfattention neural network layers (140). FIG. 1 is the representative figure.

Patent Information

Application ID202428085385
Invention FieldCOMPUTER SCIENCE
Date of Application07/11/2024
Publication Number49/2024

Inventors

NameAddressCountryNationality
HOULSBY, Neil Matthew Tinmouth1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.U.K.U.K.
GELLY, Sylvain1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.FranceFrance
USZKOREIT, Jakob D.1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.U.S.A.U.S.A.
ZHAI, Xiaohua1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.ChinaChina
HEIGOLD, Georg1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.SwitzerlandSwitzerland
BEYER, Lucas Klaus1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.GermanyGermany
KOLESNIKOV, Alexander1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.RussiaRussia
MINDERER, Matthias Johannes Lorenz1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.GermanyGermany
WEISSENBORN, Dirk1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.GermanyGermany
DEGHANI, Mostafa1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.IranIran
DOSOVITSKIY, Alexey1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.RussiaRussia
UNTERTHINER, Thomas1600 Amphitheatre Parkway, Mountain view, California 94043, United States of America.ItalyItaly

Applicants

NameAddressCountryNationality
GOOGLE LLC1600 Amphitheatre Parkway, Mountain View, California 94043, United States of America.U.S.A.U.S.A.

Specification

EXTRACTED FROM WIPO
PROCESSING IMAGES USING SELF-ATTENTION BASED NEURAL NETWORKS

BACKGROUND

This specification relates to processing images using neural networks.

Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Some neural networks include one or more hidden layers in addition to an output layer. The output of each hidden layer is used as input to the next layer in the network, i.e., the next hidden layer or the output layer. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters.

SUMMARY

This specification describes a system implemented as computer programs on one or more computers in one or more locations that executes a self-attention based neural network that has been configured through training to process one or more images to generate a network output that characterizes the one or more images.

The self-attention based neural network can be configured to process an input sequence representing an image by applying a self-attention mechanism across the elements of the input sequence, generating an output sequence. At least some of the elements of the input sequence can correspond to respective patches of the input image. That is, the system can segment the image into patches and process the pixels of each patch to generate a respective element of the input sequence. By applying a self-attention mechanism to these elements, the self-attention based neural network can attend over the entire image, leveraging both local and global information to generate the output sequence.

The subject matter described in this specification can be implemented in particular embodiments so as to realize one or more of the following advantages.

Some existing systems use self-attention based neural networks for natural language processing (NLP) use cases, processing a text sequence to generate a prediction about the text sequence. An advantage of self-attention based neural networks in the NLP domain is scalability; generally, the performance of a self-attention based neural network improves as the size of the neural network grows. However, in existing systems that apply self-attention based neural networks to images, the same has not been true; generally, the self-attention based neural networks have been unable to scale to larger architectures and therefore do not perform as well as other computer vision systems, e.g., convolutional neural networks. For example, some such existing systems do not apply self-attention across an entire input image and instead apply self-attention to local neighborhoods of the input image. Therefore, a first local neighborhood of the image cannot attend to a second local neighborhood of the image.

Using techniques described in this specification, a system can process images directly using a self-attention based neural network and enjoy high performance even as the size of the neural network grows. In particular, techniques described in this specification leverage the parallelization that is possible using self-attention based neural networks to permit large scale training, leading to improved accuracy in image processing tasks. As a particular example, systems described in this specification may be trained on datasets comprising 14 million to 300 million images. Furthermore, example implementations described in this specification apply global self-attention to full-size images. That is, the self-attention based neural network applies self-attention across an entire input image, and so any region of the image can attend to any other region of the image.

As described in this specification, a self-attention based neural network configured to process images can require far fewer computations to achieve the same performance as a state-of-the-art convolutional neural network. That is, for a fixed compute budget, the selfattention based neural network performs better than the convolutional neural network. This is because applying self-attention is generally more computationally efficient than convolving a kernel across an entire image, as the self-attention mechanism is able to attend to different regions of the image with fewer computations than convolution. As a particular example, a self-attention based neural network as described in this specification can achieve comparable or superior performance to large-scale convolutional neural networks while requiring 2x, 5x, lOx, lOOx, or lOOOx fewer computations.

The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram of an example neural network system.

FIG. 2 is a diagram of an example self-attention based neural network.

FIG. 3 illustrates example images segmented into image patches.

FIG. 4 is a diagram of an example training system.

FIG. 5 is a flow diagram of an example process for generating a prediction about one or more images using a self-atention based neural network.

Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

This specification describes a system implemented as computer programs on one or more computers in one or more locations that is configured to execute a self-atention based neural network configured to process one or more images, i.e., to process the intensity values of the pixels of the one or more images, to generate a network output that characterizes the one or more images.

FIG. 1 is a diagram of an example neural network system 100. The neural network system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations, in which the systems, components, and techniques described below can be implemented.

The neural network system 100 is configured to process an image 102 and to generate a network output 152 that represents a prediction about the image. The neural network system 100 can be configured to perform any appropriate machine learning task using the image 102. Example machine learning tasks are discussed below.

The image can be any appropriate type of image. For example, the image can be a two-dimensional image, e.g., a two-dimensional image that has multiple channels (e.g., an RGB image). As another example, the image 102 can be a hyperspectral image that represents a continuous spectrum of wavelengths, e.g., by identifying, for each pixel in the image 102, a distribution over the spectrum. As another example, the image 102 can be a point cloud that includes multiple points, where each point has a respective coordinate, e.g., in a three-dimensional or a higher-dimensional coordinate space; as a particular example, the image 102 can be a point cloud generated by a LIDAR sensor. As another example, the image 102 can be a medical image generating by a medical imaging device; as particular examples, the image 102 can be a computer tomography (CT) image, a magnetic resonance imaging (MRI) image, an ultrasound image, an X-ray image, a mammogram image, a fluoroscopy image, or a positron-emission tomography (PET) image.

Although the below description refers to generating image patches of the image 102 that each include respective "pixels" of the image 102, it is to be understood that the neural network system 100 can generate image patches that include components of the image 102

that are of any appropriate type. For example, if the image 102 is a point cloud, then each image patch of the image 102 can include a subset of the points in the point cloud. As another example, if the image 102 is an MRI image that includes multiple voxels in a three-dimensional voxel grid, then each image patch of the image 102 can include a subset of the voxels in the voxel grid.

The neural network system 100 includes an image patch generation system 110, an image patch embedding system 120, and a neural network 130. As is described in more detail below, the neural network 130 is a self-attention based neural network that includes a self-attention based subnetwork 140.

A self-attention based neural network is a neural network that includes one or more self-attention neural network layers. A self-attention neural network layer is configured to receive as input a sequence of layer input elements and to apply an attention mechanism over the sequence of layer input elements to generate a sequence of layer outputs elements. In particular, for each layer input element, the self-attention neural network layer applies the attention mechanism over the layer input elements in the sequence of layer input elements using one or more queries derived from the layer input element to generate a respective output element.

In the example depicted in FIG. 1, the neural network 130 is configured to process, using the self-attention based subnetwork 140, an input sequence that includes input elements representing respective patches of the image 102. Thus, the neural network 130 can apply an attention mechanism to the input sequence in order to attend to different patches at different locations in the image 102. It will be understood that the patches of the image 102 may be processed by the self-attention based subnetwork 140 using parallel processing, i.e. at least part of the processing may be performed in parallel.

The image patch generation system 110 is configured to process the image 102 and to generate n different patches 112a-n of the image 102. In this specification, an image patch of an image is a strict subset of the pixels of the image. Generally, each image patch 112a-n includes multiple contiguous pixels of the image 102. That is, for each particular image patch 112a-n and for any pair of pixels in the particular image patch 112a-n, there exists a path from the first pixel of the pair to the second pixel of the pair where the path only includes pixels in the particular image patch 112a-n.

In some implementations, each pixel in the image 102 is included in exactly one of the image patches 112a-n. In some other implementations, one or more image patches 112a-n can include the same pixel from the image 102, i.e., two or more of the images patches can overlap. Instead or in addition, one or more pixels from the image 102 can be excluded from each of the image patches 112a-n, i.e., one or more pixels are not included in any of the image patches.

The image patches 112a-n can be represented in any appropriate way. For example, each image patch 112a-n can be represented as a two-dimensional image that includes the pixels of the image patch 112a-n, e.g., an image that maintains the spatial relationships of the pixels in the image patch 112a-n.

As another example, each image patch 112a-n can be represented as a onedimensional sequence of the pixels of the image patch 112a-n. As a particular example, if the image patch 112a-n is a two-dimensional region of the image 102, then the image patch 112a-n can be a flattened version of the two-dimensional region, as is described in more detail below. As another particular example, if the image patch 112a-n includes only pixels that share the same column or row of the image 102 (i.e., if the image patch 112a-n is a onedimensional region of the image 102), then the image patch 112a-n can be represented as a one-dimensional sequence that maintains the relative positions of the pixels.

As another example, each image patch 112a-n can be represented as an unordered set of the pixels of the image patch 112a-n.

Example image patches are described in more detail below with reference to FIG. 3.

The image patch embedding system 120 is configured to obtain the n image patches 112a-n of the image 10, and to generate a respective embedding 122a-n of each of the n image patches 112a-n. Each image patch embedding 122a-n represents the pixels of the corresponding image patch 112a-n and can be generated by processing the pixels of the corresponding image patch 112a-n. In this specification, an embedding is an ordered collection of numeric values that represents an input in a particular embedding space. For example, an embedding can be a vector of floating point or other numeric values that has a fixed dimensionality.

In some implementations in which each image patch 112a-n is represented as a two-dimensional sub-image of the image 102, each image patch embedding 122a-n is a reshaped version of the corresponding image patch 112a-n. For example, the image patch embedding system 120 can "flatten" each image patch 112a-n to generate an image patch embedding 122a-n that is a one-dimensional tensor that includes each pixel in the image patch 112a-n. As a particular example, if each image patch 112a-n has dimensionality L x W x C, where C represents the number of channels of the image (e.g., C = 3 for an RGB image), then the

image patch embedding 122a-n can generate an image patch embedding 122a-n that has dimensionality 1 x (L ■ W ■ C).

In some other implementations, the image patch embedding system 120 can process a one-dimensional tensor that includes the pixels of the image patch 112a-n (e.g., a flattened version of the image patch 112a-n) to generate the corresponding image patch embedding 122a-n. As described in more detail below, the image patch embeddings 122a-n are to be processed by the neural network 130, which has been configured through training to accept inputs having a particular format, e.g., a particular size and shape. Thus, the image patch embedding system 120 can project each image patch 112a-n into a coordinate space that has the dimensionality required by the neural network 130.

For example, the image patch embedding system 120 can process each image patches 112a-n using a linear projection:

zL = XjE + bt

where zL G MD is the ith image patch embedding 122a-n, D is the input dimensionality required by the neural network 130, xt
is the one-dimensional tensor including the ith image patch 112a-n, N is the number of pixels in the ith image patch 112a-n, Ei G RNxD is a projection matrix, and bL G WD is a linear bias term.

In some implementations, the image patch embedding system 120 uses a respective different projection matrix Ei to generate each image patch embedding 122a-n; in some other implementations, the image patch embedding system 120 uses the same projection matrix E to generate each image patch embedding 122a-n. Similarly, in some implementations, the image patch embedding system 120 uses a respective different bias term bL to generate each image patch embedding 122a-n; in some other implementations, the image patch embedding system 120 uses the same bias term bt to generate each image patch embedding 122a-n.

In some implementations, the linear projection is machine-learned. For example, during training of the neural network 130, a training system can concurrently update the parameters of the linear projection (e.g., the parameters of the projection matrices EL and bias terms bL). As a particular example, the training system can update the parameters of the linear projection by backpropagating a training error of the neural network 130 through the neural network 130 and to the image patch embedding system 120, and determining the update using stochastic gradient descent on the backpropagated error. Example techniques for training the neural network 130 are discussed in more detail below with reference to FIG.

4.

Instead of or in addition to processing the one-dimensional tensors corresponding to the image patches 112a-n with a linear projection, the image patch embedding system 120 can process the one-dimensional tensors using an embedding neural network. For instance, the embedding system 120 can be considered a component of the neural network 130. That is, the embedding system 120 can be an embedding subnetwork of the neural network 130 that includes one or more neural networks layers that are configured to process the onedimensional tensors and to generate the image patch embeddings 122a-n.

For example, the embedding neural network can include one or more feedforward neural network layers that are configured to process a one-dimensional tensor corresponding to the image patch 112a-n.

As another example, the embedding neural network can include one or more selfattention neural network layers that are configured to process each one-dimensional tensor corresponding to a respective image patch 112a-n concurrently using a self-attention mechanism. Self-attention is discussed in more detail below.

As another example, the embedding neural network can include one or more convolutional neural network layers that are configured to process an image patch 112a-n using a convolutional filter. As a particular example, if the image patches 112a-n are represented as two-dimensional images, the image patch embedding system 120 can process each (unflattened) image patch 112a-n using one or more convolutional neural network layers to generate a feature map of the image patch 112a-n. The image patch embedding system 120 can then flatten the feature map and process the flattened feature map using a linear projection, as described above, to generate the corresponding image patch embedding 122a-n.

As another particular example, the image patch embedding system 120 can process the entire image 102 using one or more convolutional neural network layers to generate a feature map of the image 102. The feature map can be two-dimensional (or, like the image 102, can be two-dimensional where each element has multiple channels). The neural network system 100 can then determine n patches of the feature map of the image 102, where each patch includes one or more elements of the feature map. That is, instead of segmenting the image 102 itself into the image patches 112a-n, the image patch generation system 110 can segment the feature map of the image 102 generated by the embedding neural network of the image patch embedding system 120. As a particular example, each patch can include a single element of the feature map. The image patch embedding system 120 can then generate the image patch embeddings 122a-n from the n patches of the feature map, e.g., by applying a linear projection to the patches of the feature map as described above.

After the image patch embedding system 120 generates the image patch embeddings 122a-n, the neural network system 100 can generate the input sequence to be provided as input to the neural network 130 from the image patch embeddings 122a-n. Generally, the input sequence includes one or more input elements corresponding to respective image patch embeddings 122a-n. For example, the input sequence can include a respective input element corresponding to each of the n image patch embeddings 122a-n. As a particular example, the input elements corresponding to the n image patch embeddings 122a-n can be sorted in the input sequence in the raster order of the corresponding image patches 112a-n.

In some implementations, the input element in the input sequence corresponding to an image patch embedding 122a-n is equal to the image patch embedding 122a-n itself.

In some other implementations, to generate the input element of the input sequence corresponding to an image patch embedding 122a-n, the neural network system 100 can combine (i) the image patch embedding!22a-n and (ii) a positional embedding that represents the position within the image 102 of the image patch 112a-n corresponding to the image patch embedding 122a-n. For example, the neural network system 100 can append the positional embedding to the image patch embedding 122a-n. By incorporating the positional embeddings, the neural network system 100 can encode spatial information, e.g., the relative positioning of each image patch in the image, that can be leveraged by the neural network 130 to generate the network output 152.

In some implementations, the positional embedding corresponding to each image patch 112a-n of the image 102 is an integer. For example, a first image patch at the top left of the image 102 can have a positional embedding of ' 1' , a second image patch immediately to the right of the first image patch can have a positional embedding of '2', and so on.

In some other implementations, the positional embeddings are machine-learned. For example, during the training of the neural network 130, a training system can concurrently leam the positional embeddings by backpropagating a training error of the neural network 130 through the neural network 130 and to the positional embeddings. In some such implementations, the training system can generate a respective different positional embedding for each image patch (e.g., assuming every image 102 received by the neural network system 100 is segmented into the same number of patches).

In some other implementations, the training system can incorporate two-dimensional information into the positional embeddings by learning, for both dimensions of the image 102, a respective positional embedding for each coordinate along the dimension. For example, if the image 102 is segmented into a two-dimensional grid of image patches 112a-n, the training system can generate two sets of positional embeddings: a first set that includes a respective positional embedding for each index along the vertical axis of the grid and a second set that includes a respective embedding for each index along a horizontal axis of the grid. To generate the positional embedding for a particular image patch 112a-n, the neural network system can combine, e.g., by concatenating, (i) the positional embedding corresponding to the index of the particular image patch 112a-n along the vertical axis, and (ii) the positional embedding corresponding to the index of the particular image patch 112a-n along the horizontal axis.

In some implementations, one or more of the input elements in the input sequence do not correspond to any image patch 112a-n of the image 102. For example, the input sequence can include a class embedding 124 that is the same for all received images 102. For example, the class embedding 124 can be a tensor having the same dimensionality as the image patch embeddings 122a-n. As a particular example, the class embedding 124 can be a tensor of all '0's or all 'l's.

The class embedding 124 can be inserted at any position in the input sequence; e.g., the class embedding 124 can be the first input element of the input sequence, or the last input element of the input sequence.

In some implementations, the class embedding 124 is machine-learned. For example, during the training of the neural network 130, a training system can concurrently learn parameters for the class embedding 124 by backpropagating a training error of the neural network 130 through the neural network 130 and to the class embedding 124.

In implementations in which the input element corresponding to each image patch 112a-n includes a positional embedding corresponding to the image patch 112a-n, the neural network system 100 can append a positional embedding to the class embedding 124 as well, e.g., a machine-learned positional embedding or a predetermined positional embedding (e.g., a positional embedding of all '0's or all 'l's).

After generating the input sequence, the neural network system 130 can provide the input sequence as input to the neural network 130. The neural network 130 can process the input sequence to generate the network output 152.

In particular, the neural network 130 can process the input sequence using the selfattention based subnetwork 140 to generate an output sequence. In some implementations, the neural network 130 generates an output sequence of the same length as the input sequence, i.e., that includes a respective output element for each input element in the input sequence. In particular, the output sequence can include a class output 144 generated from the class embedding 124 and a respective image patch output 142a-n corresponding to each image patch embedding 122a-n in the input sequence.

The self-attention based subnetwork 140 can include one or more self-attention neural network layers that each receive a layer input sequence and apply a self-attention mechanism to the layer input sequence to generate a layer output sequence. In some such implementations, the self-attention based subnetwork 130 includes a sequence of multiple network blocks that are each configured to receive a respective block input sequence that includes a respective element corresponding to each input element in the input sequence, and process the block input sequence to generate a respective block output sequence that includes a respective element for each input element in the input sequence. Each network block can include one or more self-attention neural network layers. An example self-attention based neural network is described in more detail below with reference to FIG. 2.

After the self-attention based subnetwork 140 generates the output sequence, the neural network 130 can provide one or more elements of the output sequence to a head subnetwork 150.

For example, the head subnetwork 150 can be configured to process the n image patch outputs 142a-n. As a particular example, the head subnetwork 150 can combine the n image patch outputs 142a-n (e.g., using global average pooling) to generate a combined patch output, then process the combined patch output to generate the network output 152. For instance, the head subnetwork 150 can process the combined patch output using one or more feedforward neural network layers and/or a linear classifier.

As another example, the head subnetwork 150 can be configured to process only the class output 144 to generate the network output 152. That is, the class output 144 can represent a final representation of the image 102, and the head subnetwork 150 can process the class output 144 to generate the network output 152 that represents the prediction about the image 102. For example, the head subnetwork 150 can include a multi-layer perceptron with one or more feedforward neural network layers.

In some implementations, the self-attention based subnetwork 140 and the head subnetwork 150 have been trained concurrently end-to-end on a single machine learning task. For example, a training system can execute a supervised training process using a training data set that includes multiple training examples that each include a training input sequence (representing respective training images) and a corresponding ground-truth network output, i.e., an output that represents the network output 152 that the neural network 130 should generate in response to processing the training input sequence. The training system can

process the training input sequences using the neural network 130 to generate respective predicted network outputs, and determine a parameter update to the head subnetwork 150 and the self-attention based subnetwork 140 according to an error between (i) the predicted network outputs and (ii) the corresponding ground-truth network outputs. For instance, the training system can determine the parameter update by backpropagating the error through both the head subnetwork 150 and the self-attention based subnetwork 140 and performing stochastic gradient descent.

In some other implementations, the self-attention based subnetwork 130 has been trained using transfer learning, using one or more other head subnetworks that are different from the head subnetwork 150, e.g., that are configured to perform respective different machine learning tasks than the head subnetwork 150. For example, a training system can concurrently train the self-attention based subnetwork 130 and the one or more other head subnetworks, then remove the one or more other head subnetworks and replace them with the head subnetwork 150 to generate the neural network 130. The training system can then finetune the neural network 130 to generate trained parameters for the head subnetwork 150. Example techniques for training the neural network 130 using transfer learning are discussed in more detail below with reference to FIG. 4.

In some implementations, the neural network includes one or more additional subnetworks, e.g., one or more subnetworks directly preceding the self-attention based subnetwork 140 (e.g., a subnetwork that includes one or more recurrent neural network layers configured to process the input sequence) or directly following the self-attention based subnetwork 130 (e.g., a subnetwork that includes one or more recurrent neural network layers configured to process the input sequence).

In some implementations, the neural network 130 does not include the head subnetwork 150. For example, the neural network system 100 may be configured to generate an embedding of the image 102, where the embedding includes (or is generated from) one or more of the image patch output 142a-n and/or the class output 144. The neural network system 100 can then provide the embedding of the image 102 to a downstream system for storage or further processing, e.g., by one or more other neural networks.

For example, the neural network system can be configured to receive images 102 from an external system and to provide embeddings of the images 102 back to the system, e.g., by providing the image patch outputs 142a-n and/or the class output 144 back to the external system. The external system can be configured to process, for each image, the embedding of the image using a neural network to generate a prediction about the image; e.g., the external system can include the head subnetwork 150. As a particular example, the neural network system 100 can be configured to receive images from edge devices, e.g., mobile phone, tablet computers, or autonomous vehicles. The edge device can then execute the head subnetwork 150 to generate the prediction about the image.

As described in more detail below with reference to FIG. 4, in some implementations, the self-attention based subnetwork 140 includes many more parameters than the head subnetwork 150, and so can be more computationally expensive to execute. Thus, an edge device may not have the computational resources to execute the self-attention based subnetwork 140. Thus, the neural network system 100 can be configured to execute the selfattention based subnetwork 140 (e.g., using one or more parallel processing devices such as GPUs or TPUs), while the edge device can perform the relatively computationally-inexpensive task of executing the head subnetwork 150. For instance, the neural network system 100 can be deployed on the cloud and can be communicatively connected to multiple different edge devices.

The neural network system 100 can be configured to perform any appropriate machine learning task with respect to the image 102, e.g., a classification task, a regression task, or a combination thereof.

As a particular example, the neural network system 100 can be configured to generate a classification output that includes a respective score corresponding to each of multiple categories. The score for a category indicates a likelihood that the image belongs to the category. In some cases, the categories may be classes of objects (e.g., dog, cat, person, and the like), and the image may belong to a category if it depicts an object included in the object class corresponding to the category. In some cases, the categories may represent global image properties (e.g., whether the image depicts a scene in the day or at night, or whether the image depicts a scene in the summer or the winter), and the image may belong to the category if it has the global property corresponding to the category.

As another particular example, the neural network system 100 can be configured to generate a pixel-level classification output that includes, for each pixel in the image, a respective score corresponding to each of multiple categories. For a given pixel, the score for a category indicates a likelihood that pixel belongs to the category. In some cases, the categories may be classes of objects, and a pixel may belong to a category if it is part on an object included in the object class corresponding to the category. That is, the pixel-level classification output may be semantic segmentation output.

As another particular example, the neural network system 100 can be configured to generate a regression output that estimates one or more continuous variables (i. e. , that can assume infinitely many possible numerical values) that characterize the image. In a particular example, the regression output may estimate the coordinates of bounding boxes that enclose respective objects depicted in the image. The coordinates of a bounding box may be defined by (x, y) coordinates of the vertices of the bounding box. For example, the system may output the (x,y) coordinates of two of the coordinates of the bounding box or can output the coordinates of the center of the bounding box and the height and width of the bounding box.

In some implementations, the neural network system 100 can be configured to perform a video analysis task. For example, the neural network system 100 can receive multiple images 102 that are video frames of a video, and can process each video frame as described above to generate an output that characterizes the video frames, e.g., by characterizing whether the video frames depict a person performing a particular action.

In some such implementations, the neural network system 100 processes each video frame at respective different time points to generate a respective network output 152 for each video frame that characterizes a prediction for the video frame. For example, the neural network system 100 can generate a network output 152 that predicts a classification of the video frame. In some such implementations, the neural network system 100 combines the multiple network outputs 152 corresponding to respective video frames to generate a final network output that characterizes the video. For example, the neural network system 100 can process the respective network outputs 152 using a downstream neural network, e.g., a recurrent neural network.



37
4. The method of claim 2, wherein generating a respective input element using the
respective initial input element comprises processing the initial input element using a second
neural network.
5. The method of claim 4, wherein the second neural network comprises one or more
fully-connected neural network layers.5
6. The method of claim 1, wherein processing the plurality of image patches
corresponding to an image to generate an input sequence comprises:
processing the plurality of image patches to generate respective intermediate input
elements; and
combining, for each intermediate input element, the intermediate input element with a10
positional embedding representing a position of the corresponding image patch in the image to
generate a respective input element.
7. The method of claim 6, wherein each positional embedding is an integer.
8. The method of claim 6, wherein each positional embedding is machine-learned.
9. The method of claim 1, wherein a particular input element in the input sequence is a15
machine-learned tensor.
10. The method of claim 1, wherein a plurality of network parameters of the neural network
have been updated during training of the third neural network.
11. The method of claim 1, wherein the third neural network is a multi-layer perceptron.
12. The method of claim 1, wherein, for a respective input sequence:20
a particular input element in the input sequence is a machine-learned tensor; and
processing one or more output elements using the third neural network comprises
processing the output element corresponding to the particular input element using the third
neural network to generate a prediction for the image.
13. The method of claim 1, wherein one or more of the self-attention neural network layers25
are multi-head self-attention neural network layers.
38
14. The method of claim 1, wherein the neural network comprises a sequence of one or
more subnetworks, each subnetwork configured to receive a respective subnetwork input for
each of the plurality of input positions and to generate a respective subnetwork output for each
of the plurality of input positions, wherein each subnetwork comprises a self-attention neural
network layer and a position-wise feedforward neural network layer.5
15. The method of claim 14, wherein each subnetwork further comprises one or more of:
a first layer normalization layer that applies layer normalization to the subnetwork
inputs for each of the plurality of input positions;
a first residual connection layer that combines an output of the self-attention neural
network layer with the subnetwork inputs for each of the plurality of input positions;10
a second layer normalization layer that applies layer normalization to an output of the
first residual connection layer; or
a second residual connection layer that combines an outputs of the position-wise feed-
forward neural network layer with the output of the first residual connection layer.
16. The method of claim 1, wherein:15
the network output comprises a classification output that includes a respective score
corresponding to each of multiple categories, a score for a category indicating a likelihood that
the image belongs to the category;
the network output comprises a pixel-level classification output that includes, for each
pixel in the image, a respective score corresponding to each of multiple categories, wherein the20
score for a category indicates a likelihood that the pixel belong to the category;
the network output comprises coordinates for one or more bounding boxes that enclose
respective objects depicted in the image, or
the neural network receives multiple images that are video frames of a video, and the
network output comprises an output that characterizes the video frames.25
17. A system comprising one or more computers and one or more storage devices storing
instructions that when executed by the one or more computers cause the one more computers to
perform operations comprising:
obtaining one or more images comprising a plurality of pixels;
39
determining, for each image of the one or more images, a plurality of image patches of
the image, wherein each image patch comprises a different subset of the pixels of the image;
processing, for each image of the one or more images, the corresponding plurality of
image patches to generate an input sequence comprising a respective input element at each of a
plurality of input positions, wherein a plurality of the input elements correspond to respective5
different image patches; and
processing the input sequences using a neural network to generate a network output that
characterizes the one or more images, wherein the neural network comprises one or more self-
attention neural network layers and wherein the processing comprises:
processing the input sequence using the neural network to generate a respective10
output element for each input element in the input sequence; and
processing one or more of the output elements using a third neural network to
generate the network output, wherein the third neural network is configured to generate
network outputs of a first type and the neural network has been trained concurrently with a
fourth neural network to generate network outputs of a second type that is different from the15
first type.
18. One or more non-transitory computer storage media storing instructions that when
executed by one or more computers cause the one more computers to perform operations
comprising:
obtaining one or more images comprising a plurality of pixels;20
determining, for each image of the one or more images, a plurality of image patches of
the image, wherein each image patch comprises a different subset of the pixels of the image;
processing, for each image of the one or more images, the corresponding plurality of
image patches to generate an input sequence comprising a respective input element at each of a
plurality of input positions, wherein a plurality of the input elements correspond to respective25
different image patches; and
processing the input sequences using a neural network to generate a network output that
characterizes the one or more images, wherein the neural network comprises one or more self-
attention neural network layers and wherein the processing comprises:
processing the input sequence using the neural network to generate a respective30
output element for each input element in the input sequence; and
40
processing one or more of the output elements using a third neural network to
generate the network output, wherein the third neural network is configured to generate
network outputs of a first type and the neural network has been trained concurrently with a
fourth neural network to generate network outputs of a second type that is different from the
first type.5
19. The system of claim 17, wherein processing the plurality of image patches
corresponding to an image to generate an input sequence comprises, for each image patch:
generating a respective one-dimensional initial input element that includes the pixels of
the image patch; and
generating a respective input element using the respective initial input element.10
20. The system of claim 19, wherein generating a respective input element using the
respective initial input element comprises processing the initial input element using a second
neural network.
21. A method performed by one or more computers, the method comprising:
obtaining one or more images comprising a plurality of pixels;15
determining, for each image of the one or more images, a plurality of image patches of
the image, wherein each image patch comprises a different subset of the pixels of the image;
processing, for each image of the one or more images, the corresponding plurality of
image patches to generate a respective input sequence comprising a respective input element at
each of a plurality of input positions, wherein a plurality of the input elements correspond to20
respective different image patches, and wherein the processing comprises, for each image of
the one or more images:
processing the plurality of image patches of the image to generate respective
intermediate input elements for each of the image patches; and
combining, for each intermediate input element, the intermediate input element25
with a positional embedding representing a position of the corresponding image patch in the
image to generate the respective input element for the corresponding image patch; and
processing the respective input sequences for the one or more images using a neural
network to generate a network output that characterizes the one or more images, wherein the
41
neural network comprises one or more self-attention neural network layers.
22. The method of claim 21, wherein processing the plurality of image patches of the image
to generate respective intermediate input elements for each of the image patches comprises, for
each image patch:
generating a respective one-dimensional initial input element that includes the pixels of
the image patch; and
generating the respective intermediate input element for the image patch using the
respective initial input element.
23. The method of claim 22, wherein each image patch has dimensionality 𝐿𝐿 × 𝑊𝑊 × 𝐶𝐶,
wherein C represents a number of channels of the image, and wherein each initial input element
has dimensionality 1 × (𝐿𝐿 ∙ 𝑊𝑊 ∙ 𝐶𝐶).
24. The method of claim 22, wherein generating a respective intermediate input element
using the respective initial input element comprises processing the initial input element using a
second neural network.
25. The method of claim 24, wherein the second neural network comprises one or more
fully-connected neural network layers.
26. The method of claim 25, wherein the second neural network is a single fully-connected
layer.
27. The method of claim 21, wherein each positional embedding is an integer.
28. The method of claim 21, wherein each positional embedding is machine-learned.
29. The method of claim 21, wherein a particular input element in the input sequence is a
machine-learned tensor.
30. The method of claim 21, wherein processing the respective input sequences for the one
or more images using a neural network to generate a network output that characterizes the one
or more images comprises:
processing respective input sequences for the one or more images using the neural5
42
network to generate a respective output element for each input element in the respective input
sequences; and
processing one or more of the output elements using a third neural network to generate
the network output.
31. The method of claim 30, wherein:5
the third neural network is configured to generate network outputs of a first type; and
the neural network has been trained concurrently with a fourth neural network to
generate network outputs of a second type that is different from the first type.
32. The method of claim 31, wherein a plurality of network parameters of the neural
network have been updated during training of the third neural network.10
33. The method of claim 30, wherein the third neural network is a multi-layer perceptron.
34. The method of claim 30, wherein, for a respective input sequence:
a particular input element in the input sequence is a machine-learned tensor; and
processing one or more output elements using the third neural network comprises
processing the output element corresponding to the particular input element using the third15
neural network to generate the prediction of the image.
35. The method of claim 21, wherein one or more of the self-attention neural network
layers are multi-head self-attention neural network layers.
36. The method of claim 21, wherein the neural network comprises a sequence of one or
more subnetworks, each subnetwork configured to receive a respective subnetwork input for20
each of the plurality of input positions and to generate a respective subnetwork output for each
of the plurality of input positions, wherein each subnetwork comprises a self-attention neural
network layer and a position-wise feedforward neural network layer.
37. The method of claim 36, wherein each subnetwork further comprises one or more of:
a first layer normalization layer that applies layer normalization to the subnetwork25
inputs for each of the plurality of input positions;
a first residual connection layer that combines an output of the self-attention neural
43
network layer with the subnetwork inputs for each of the plurality of input positions;
a second layer normalization layer that applies layer normalization to an output of the
first residual connection layer; or
a second residual connection layer that combines an outputs of the position-wise feed-
forward neural network layer with the output of the first residual connection layer.5
38. The method of claim 21, wherein:
the network output comprises a classification output that includes a respective score
corresponding to each of multiple categories, a score for a category indicating a likelihood that
the image belongs to the category;
the network output comprises a pixel-level classification output that includes, for each10
pixel in the image, a respective score corresponding to each of multiple categories, wherein the
score for a category indicates a likelihood that the pixel belong to the category;
the network output comprises coordinates for one or more bounding boxes that enclose
respective objects depicted in the image, or
the neural network receives multiple images that are video frames of a video, and the15
network output comprises an output that characterizes the video frames.
39. A system comprising one or more computers and one or more storage devices storing
instructions that when executed by the one or more computers cause the one more computers to
perform operations comprising:
obtaining one or more images comprising a plurality of pixels;20
determining, for each image of the one or more images, a plurality of image patches of
the image, wherein each image patch comprises a different subset of the pixels of the image;
processing, for each image of the one or more images, the corresponding plurality of
image patches to generate a respective input sequence comprising a respective input element at
each of a plurality of input positions, wherein a plurality of the input elements correspond to25
respective different image patches, and wherein the processing comprises, for each image of
the one or more images:
processing the plurality of image patches of the image to generate respective
intermediate input elements for each of the image patches; and
combining, for each intermediate input element, the intermediate input element30
44
with a positional embedding representing a position of the corresponding image patch in the
image to generate the respective input element for the corresponding image patch; and
processing the respective input sequences for the one or more images using a neural
network to generate a network output that characterizes the one or more images, wherein the
neural network comprises one or more self-attention neural network layers.5
40. One or more non-transitory computer storage media storing instructions that when
executed by one or more computers cause the one more computers to perform operations
comprising:
obtaining one or more images comprising a plurality of pixels;
determining, for each image of the one or more images, a plurality of image patches of10
the image, wherein each image patch comprises a different subset of the pixels of the image;
processing, for each image of the one or more images, the corresponding plurality of
image patches to generate a respective input sequence comprising a respective input element at
each of a plurality of input positions, wherein a plurality of the input elements correspond to
respective different image patches, and wherein the processing comprises, for each image of15
the one or more images:
processing the plurality of image patches of the image to generate respective
intermediate input elements for each of the image patches; and
combining, for each intermediate input element, the intermediate input element
with a positional embedding representing a position of the corresponding image patch in the20
image to generate the respective input element for the corresponding image patch; and
processing the respective input sequences for the one or more images using a neural
network to generate a network output that characterizes the one or more images, wherein the
neural network comprises one or more self-attention neural network layers.
41. A method performed by one or more computers, the method comprising:25
obtaining training data for a first machine learning task, the training data comprising a
plurality of first training images;
training a base neural network on the training data for the first machine learning task,
wherein the base neural network comprises (i) a self-attention based subnetwork that comprises
one or more self-attention neural network layers and (ii) and a base head subnetwork, andwherein the base neural network is configured to, for each of the first training images:
determine a plurality of image patches of the first training image, wherein each
image patch comprises a different subset of the pixels of the first training image;
process the plurality of image patches to generate an input sequence comprising
a respective input element at each of a plurality of input positions, wherein a plurality of the5
input elements correspond to respective different image patches;
process the input sequence using the self-attention based subnetwork to generate
a respective output element for each input element in the input sequence; and
process one or more of the output elements using the base head subnetwork to
generate a network output for the first machine learning task;10
obtaining training data for a second machine learning task, the training data comprising
a plurality of second training images; and
training a task neural network on the training data for the second machine learning task,
wherein the task neural network comprises (i) the self-attention based subnetwork that
comprises one or more self-attention neural network layers and (ii) and a task head subnetwork.15
42. The method of claim 41, wherein the base neural network is configured to, for each of
the second training images:
determine a second plurality of image patches of the second training image,
wherein each image patch comprises a different subset of the pixels of the second training
image;20
process the second plurality of image patches of the second training image to
generate a second input sequence comprising a respective second input element at each of a
plurality of second input positions, wherein a plurality of the second input elements correspond
to respective different image patches from the second plurality of image patches;
process the second input sequence using the self-attention based subnetwork to25
generate a respective second output element for each second input element in the second input
sequence; and
process one or more of the second output elements using the task head
subnetwork to generate a network output for the second machine learning task.
43. The method of claim 42, wherein the second plurality of image patches has a different30
46
number of image patches than the first plurality of image patches.
44. The method of claim 42, wherein the task neural network includes a second neural
network and wherein processing the second plurality of image patches of the second training
image to generate a second input sequence comprising a respective second input element at
each of a plurality of second input positions comprises, for each image patch of the second5
training image:
generating a respective one-dimensional initial input element that includes the pixels of
the image patch; and
generating a respective second input element using the respective initial input element,
comprising processing the respective initial input element using the second neural network.10
45. The method of claim 44, wherein the base neural network includes the second neural
network and wherein processing the plurality of image patches of the first training image to
generate an input sequence comprising a respective input element at each of a plurality of input
positions comprises, for each image patch of the first training image:
generating a respective one-dimensional initial input element that includes the pixels of15
the image patch; and
generating a respective input element using the respective initial input element,
comprising processing the respective initial input element using the second neural network.
46. The method of claim 44, wherein:
generating a respective second input element using the respective initial input element20
comprises generating the respective second input element using a second positional embedding
representing a position of the corresponding image patch in the second training image.
47. The method of claim 46, wherein:
generating a respective input element using the respective initial input element
comprises generating the respective input element using a first positional embedding25
representing a position of the corresponding image patch in the first training image.
48. The method of claim 47, wherein the first positional embeddings representing positions
of the image patches of the first training image are machine-learned during the training of the
47
base neural network.
49. The method of claim 48, further comprising:
determining the second positional embeddings representing the positions of the image
patches in the second training image using the first positional embeddings representing
positions of the image patches of the first training image.5
50. The method of claim 49, wherein determining the second positional embeddings
representing the positions of the image patches in the second training image using the first
positional embeddings representing positions of the image patches of the first training image
comprises:
performing two-dimensional interpolation on the first positional embeddings10
representing positions of the image patches of the first training image.
51. The method of claim 49, wherein training the task neural network on the training data
for the second machine learning task comprises fine-tuning the first positional embeddings
representing positions of the image patches of the first training image.
52. The method of claim 41, wherein the first machine learning task is a self-supervised15
machine learning task.
53. The method of claim 41, wherein the first machine learning task is an image
classification task.
54. The method of claim 41, wherein the second machine learning task is an image
classification task or an object detection task.20
55. The method of claim 41, wherein training the task neural network on the training data
for the second machine learning task comprises training the task neural network starting from
parameter values of the self-attention based subnetwork determined by training the base neural
network on the training data for the first machine learning task.
25
56. The method of claim 41, wherein the first training images have a different resolution
from the second training images.
48
57. The method of claim 41, wherein training a task neural network on the training data for
the second machine learning task comprises:
updating parameters of the task head subnetwork while freezing parameters of the self-
attention based subnetwork.
58. The method of claim 41, wherein training a task neural network on the training data for5
the second machine learning task comprises:
updating parameters of the task head subnetwork and parameters of the self-attention
based subnetwork.
59. A system comprising one or more computers and one or more storage devices storing
instructions that when executed by the one or more computers cause the one more computers to10
perform operations comprising:
obtaining training data for a first machine learning task, the training data comprising a
plurality of first training images;
training a base neural network on the training data for the first machine learning task,
wherein the base neural network comprises (i) a self-attention based subnetwork that comprises15
one or more self-attention neural network layers and (ii) and a base head subnetwork, and
wherein the base neural network is configured to, for each of the first training images:
determine a plurality of image patches of the first training image, wherein each
image patch comprises a different subset of the pixels of the first training image;
process the plurality of image patches to generate an input sequence comprising20
a respective input element at each of a plurality of input positions, wherein a plurality of the
input elements correspond to respective different image patches;
process the input sequence using the self-attention based subnetwork to generate
a respective output element for each input element in the input sequence; and
process one or more of the output elements using the base head subnetwork to25
generate a network output for the first machine learning task;
obtaining training data for a second machine learning task, the training data comprising
a plurality of second training images; and
training a task neural network on the training data for the second machine learning task,
wherein the task neural network comprises (i) the self-attention based subnetwork that30
49
comprises one or more self-attention neural network layers and (ii) and a task head subnetwork.
60. One or more non-transitory computer storage media storing instructions that when
executed by one or more computers cause the one more computers to perform operations
comprising:
obtaining training data for a first machine learning task, the training data comprising a5
plurality of first training images;
training a base neural network on the training data for the first machine learning task,
wherein the base neural network comprises (i) a self-attention based subnetwork that comprises
one or more self-attention neural network layers and (ii) and a base head subnetwork, and
wherein the base neural network is configured to, for each of the first training images:10
determine a plurality of image patches of the first training image, wherein each
image patch comprises a different subset of the pixels of the first training image;
process the plurality of image patches to generate an input sequence comprising
a respective input element at each of a plurality of input positions, wherein a plurality of the
input elements correspond to respective different image patches;15
process the input sequence using the self-attention based subnetwork to generate
a respective output element for each input element in the input sequence; and
process one or more of the output elements using the base head subnetwork to
generate a network output for the first machine learning task;
obtaining training data for a second machine learning task, the training data comprising20
a plurality of second training images; and
training a task neural network on the training data for the second machine learning task,
wherein the task neural network comprises (i) the self-attention based subnetwork that
comprises one or more self-attention neural network layers and (ii) and a task head subnetwork

Documents

NameDate
Abstract1.jpg04/12/2024
202428085385-Proof of Right [02-12-2024(online)].pdf02/12/2024
202428085385-FORM-26 [19-11-2024(online)].pdf19/11/2024
202428085385-COMPLETE SPECIFICATION [07-11-2024(online)].pdf07/11/2024
202428085385-DECLARATION OF INVENTORSHIP (FORM 5) [07-11-2024(online)].pdf07/11/2024
202428085385-DRAWINGS [07-11-2024(online)].pdf07/11/2024
202428085385-FIGURE OF ABSTRACT [07-11-2024(online)].pdf07/11/2024
202428085385-FORM 1 [07-11-2024(online)].pdf07/11/2024
202428085385-FORM 18 [07-11-2024(online)].pdf07/11/2024
202428085385-REQUEST FOR EXAMINATION (FORM-18) [07-11-2024(online)].pdf07/11/2024

footer-service

By continuing past this page, you agree to our Terms of Service,Cookie PolicyPrivacy Policy  and  Refund Policy  © - Uber9 Business Process Services Private Limited. All rights reserved.

Uber9 Business Process Services Private Limited, CIN - U74900TN2014PTC098414, GSTIN - 33AABCU7650C1ZM, Registered Office Address - F-97, Newry Shreya Apartments Anna Nagar East, Chennai, Tamil Nadu 600102, India.

Please note that we are a facilitating platform enabling access to reliable professionals. We are not a law firm and do not provide legal services ourselves. The information on this website is for the purpose of knowledge only and should not be relied upon as legal advice or opinion.