Ziad Doughan, Rola Kassem, Ahmad M. El-Hajj, Ali M. Haidar
©SHUTTERSTOCK.COM/ANDREY SUSLOV
Decades have passed since the world went crazy about the revolution of the machine. Through the years, pioneers in the field tried to imagine how such progress in artificial intelligence (AI) could influence our lives. Were we going to witness sentient machines among us? What would their capabilities, superpowers, and limitations be? Were there ghosts in the machines? Many of these questions confused the greatest minds of our era.
Since the 1980s, the movie industry has strived to produce the most exciting yet inspiring fantastic robots, from R2-D2 and C-3PO in the Star Wars movies to Optimus Prime in the Transformers cartoon series. A great deal of effort was put into providing audiences with extraordinary machines that are still convincing crowds with their exceptional abilities.
We still remember the day when we watched the second Terminator movie. This metallic skeleton enclosed with a skin-mimicking membrane was fascinating. The robot had a super form of physical abilities, especially its vision. One of the scenes lets the spectator see through the eyes of the machine. The plot shows how the robot can recognize facial features and people. It can analyze all of the vital signs of a person and fetch from a huge data bank to retrieve personal information, making the Terminator a dangerous environmental scanning machine. Figure 1 shows how a movie may depict AI vision.
Fig 1 An example showing how the movie industry introduced artificial intelligence vision.
Inspired by this style, pioneers in the field of computer vision are pushing the industry to design a similar approach in robotics. However, we will not be able to see some robots wandering in the streets yet—simply because robots do not have the ability to see the world, neither as in the movies nor as we humans do. What is fake and what is real concerning AI vision?
The future of computer vision is directly dependent on AI progress. This near future will provide easier training approaches but also acute image detection. However, AI vision in conjunction with other systems may allow for more potential AI applications. Imagine the combination of generative image caption with natural language generation to generate and interpret an artificial memory grid. Such hybrid systems could tackle the challenge of humans’ audiovisual system. Computers will play a role in associating objects with their surroundings—by that, we mean sound, smell, touch, and so on—leading the way to a major new perspective in AI. Therefore, this revolutionary approach in information processing will play a vital role in the design of artificial general intelligence (AGI). Furthermore, in some industrial applications, there is the prospect of surpassing humans and reaching a higher state of artificial super intelligence (ASI).
However, before reaching human-like AI, a lot of research is aiming to build the so-called cognitive architecture to ignite the age of AGI. This typology is deeply inspired by the physiological and psychological aspects of the brain. The target is to generate some knowledge-based processing system that operates on a procedural memory as well as previous experiences stored in periodic memory units. Here, the progression in AI vision will lead the way to image comprehension, in which robots can apply deep learning and then plan or react as needed.
Despite the success of the Transformer networks, especially in knowledge association, the pioneer Geoffrey Hinton expects a state-of-the-art error degradation using capsule networks. This prototype designates by a capsule a set of neurons in which each one handles a specific data trait. In addition, the parallel nature of the capsules allows for the simultaneous fetching of multiple operations.
Another promising approach is the combination of artificial neural networks (ANNs) with evolutionary algorithms. This can be applied in simulations and video generation for natural events that require exhaustive iterations. Moreover, such hybrids are especially used in swarm intelligence when trying to mimic microorganisms or crowd behavior. Additionally, research may implant those models into tiny hardware to cooperate, leading to repeated crossover performance and causing mutations in the entire system.
Those evolutions in computer vision will directly affect edge computing and the Internet of Things (IoT). The collaborative image processing behavior will monitor collective information. The decentralized processing of first-level information can relieve the stress draining the resources of cloud and data centers. Such an approach can secure the service, reduce network outages, and clear a space in the servers for more complicated applications.
Using advanced AI vision will provide an instantaneous response and insightful reaction, leading the way for ASI applications. Vision-guided robots will change manufacturing methods and quality inspection. In some applications, the robot must guarantee fault-free products. This will require an investment in 3D inspection models with high-quality picture analysis. Fault detection is not limited to products, but it may be useful in financial transfer. Detecting vulnerabilities in the financial sector can prevent fraud and reduce hacking risks. This can be realized by using an extra layer of graphical authenticating in which adversarial training will be the concrete ground for, possibly, providing extraordinary security.
Emulating human intelligence is still extremely hard to achieve due to limitations in biological explorations of the brain and its operation. Therefore, AGI will remain hypothetical for the near future. However, some ASI models surpassed humans easily in some classifications or games, such as AlphaGo. Aiming for a revolutionary progression in ANN vision requires some untraditional observation of natural intelligence. Therefore, ANN vision research requires a lot of brainstorming and preparation. Therefore, we advise undertaking a predesign phase in which three examinations are necessary before starting the research:
To conclude, the imminent advance in AI vision will pave the road for ML. It is the destiny of young minds to decipher the secrets of natural intelligence to overcome the challenges and reveal unexplored potentials.
Whether you are aiming to mimic some science fiction movies or even planning to research the latest trends, there are many applications to consider in computer vision. Many labs are trying to adopt novel approaches to solve hard problems and ensure a tremendous experience for their customers. The following are some examples:
Today, ANN applications, especially in autonomous driverless car systems, are the most promising for making AI movie-like vision a reality. Whereas ANN systems provide state-of-the-art applications for image and video processing in which objects have well-defined structures, an increasing concentration is devoted to trying to apply this type of network for nongeometric data sets. Despite the good performance of ANNs, training remains a frustrating mission. Especially in back-propagation, sometimes, the whole mechanism becomes meaningless. Earlier works made a breakthrough in understanding the evolving nature of ANNs. Insightfully, many studies disregarded the complex connection between the training and data features. Regardless of the nature of the ANN that works on generalizing sensitive measures, the main concern of optimization focused on the size of ANNs.
Furthermore, the training phase of an ANN without human supervision is the interest of most researchers in the field. Automatic ANN construction algorithms usually emerge from two conventions: reinforcement learning or evolutionary algorithms. In the reinforcement learning method, a sequence of actions defines the architecture of the ANN, while, in evolutionary algorithms, the mutation of the components picks up which of the parts will remain for the next iteration based on the best performance.
The reinforcement learning inspiration is to operate on information-based purposes. This method utilizes the information dispersal shape to find an optimum resolution. Exploring the most influential element of a certain situation reflects the optimum ANN model. Therefore, reinforcement learning is progressively necessary when working in diverse situations. From the information theory point of view, the automatic formulization of a study is important to identify a powerful agent that influences the future value of the system. Many methods constrain the designer with some limitations, such as not having an internal state or not providing feedback loops. Therefore, reinforcement learning provides the needed influential measures, taking into consideration the provision of a large action set by discovering meaningful distribution spaces.
On the other hand, a fully unsupervised learning featured a built-in optimization attitude, especially in image matching and object detection. It established a promising benchmark that saw multiple exploring paths to tackle small features. Prefilter systems, which focus on analyzing features for neighborhood and region selection, simplify multiobject detection. In addition, symmetric models cooperate to examine undirected graphs and recognize deep structures that reflect better results. The continuous improvement can ensure the durability to handle larger data sets with variable clusters.
Well, with a good understanding of an ANN application and a generative adversarial network (GAN) model, fooling AI may be quite easy. ML techniques are incomplete in many fields that deal with real-world data. To date, modeling an ANN dealing with pattern recognition requires a lot of caution. The relational nature between actions and effects provides channels of data for a system in operation. This means that, to design an ANN that controls an environment, the model must be able to exploit and understand this relation.
The multilayer structure of ANNs provides a skimming filter at the very early layers of the network and very tangible spreads at the end. Therefore, the internal mechanism updates thousands of parameters simultaneously. To be more specific, different ANNs in image processing may require focusing on completely different features depending on the application. For example, if a network analyzes human faces, it will likely center its interest on eyes, noses, and so on. To ensure similarities, ANNs should be resistant to the data set variability.
Even with the outstanding results of deep neural networks (DNNs), these networks are not able to generalize strange inputs. Regardless of whether the inputs are natural or generated samples in 3D, a DNN fails to harness the parameters that evaluate objects in different situations. Therefore, 3D scanning converges to be extremely hard. Furthermore, mastering the classification of a person as a person is very difficult. It is a huge challenge to provide a database for humans in danger. Therefore, the potential to investigate 3D model generation is tremendously important, especially in ANN applications related to safety.
On the other hand, natural language processing (NLP) based on DNNs is quite well performing. Text recognition plays an imperative role in modern NLP applications. A variety of applications, like automatic data entry and code reading, are embedded in many devices like mobile and handheld units. In addition, many spam filters work on text classification.
However, DNNs are not immune to noise and distortions. Many works reflect that DNNs are weak when exposed to rough image alterations; it is highly probable that the pictures will be misclassified. Artificial perturbation can generate “adversarial images” to mislead DNNs easily. The main concept of this mechanism is to add tiny pieces to be correctly additive to a natural image. This act will push the DNN to classify the object in an incorrect class. Therefore, input perturbation attacks can change an optimal network to an infeasible solution even with a slight change in data.
Figure 2 shows how easily a patch can mislead an ANN in classifying letters. The mess in the actual input confuses the network. The targeted ANN generates an unusual response to a well-known letter. Unfortunately, this scheme works well in fooling ANNs. Other examples direct their efforts to generate universal patches that can mislead many networks in understanding images. To ensure the strength of the attack, optimization works to cover a wide diversity of poses so that it can target a wider system range.
Fig 2 An attack example that involves placing a patch in the image frame.
Recently, adversarial attacks have been influential in real-world applications, where known objects fall as targets for the patch-overlaying mechanism. To reduce the effect of adversarial perturbation, researchers have proposed many exposure methods. For example, a method called network distillation can squeeze the input to a smaller dimension to reduce the variances in the object detected, leading to a more robust performance. When a DNN learns to operate on the MNIST and CIFAR data sets, the network is ready to perform perfectly on characters in their canonical form. Therefore, the DNN output is not sufficient to map a correct disparity with deceptive areas of low resolutions or blocked parts.
To improve the disparity of an analyzed item, a postprocessing model can follow the DNN to improve the output. This approach includes cross-cost accumulation, pixel enhancement, and bidirectional models. For example, a classic problem in low-resolution pictures persists in image noise rectification. This active subject aims to improve the observation of an object, where additive patches clean perturbed regions. An assortment of filter-, diffusion-, and variation-based methods establish a sparse representation of the deteriorated regions built on a similarity assessment. Others use optimization tools to improve their accuracy by using different comparison schemes.
In contrast, gradient-based methods target the structure and parameters of the network itself. By nature, they are simpler and require less computation. The attacker does not even trouble him- or herself with the challenge of attacking the network structure directly. It is sufficient to use random transformation samples to reduce the accuracy dramatically. To overcome several obstacles in ANNs, GANs can provide object patches to correct noise. Using two models concurrently, the generative model produces samples to train the discriminative model. As samples become harder to discriminate, the system becomes more resistive to distortions. The attacker must generate better samples than the generative network to fool the discriminator.
This leads to the conclusion that adversarial training is essential to strengthen ANNs and solve the concern of safety and security. However, one needs an appropriate inspection of the network in natural settings. Adversarial perturbations can appear in many scenarios. While some training approaches focus on one variant, others use multivariant setups to challenge ANNs. This relates to the nature of ANNs that necessitate many aspects of robustness. Combining multiple transformations together armors the network to acquire a cumulative resistance. This highlights the necessity of the current research in adversarial training to ensure a natural response to different arising circumstances.
Understanding the point of attraction in a scene is the basic task in automatic frame prediction. The main goal in computer vision is to identify which object exists in the frame and determine the relation between the elements of the scene. Therefore, mimicking a natural visual system is a hard problem. There is a huge superiority in the human ability to transform visual perception into an imaginative morphology with respect to the circumstances.
The structure similarities, feature sharpness, and noise ratio provide a great level of assessment for many frame prediction and image classification schemes. In contrast to traditional image processing systems, an external model can optimize an ANN. Therefore, many hybrid systems deal with the optimization of ANN classifiers, which rely on periodic supervised training plans. Usually, picture processing systems aim to reduce the complexity of the model. It is beneficial to optimize feature representation and spatial resolution. As the density becomes deeper, pooling layers are essential to maintain confined zones with a certain level of amplitude. Therefore, to analyze the effect of the pooling layers, the modeling process must maintain well the features of interest after traversing these layers.
Most optimizations performed during the design phase of a DNN relate to the weight values and activation results. For example, the conventional mechanism of back-propagation, which consists of a forward phase and a backward propagation phase, has an optimization process that connects the weights to a binary value to eliminate the frustration of the multiplication. Limiting the weights to “zero or one” proved very effective for an application on MNIST and CIFAR-10. While the characters vary slightly in most data sets, a distortion of the whole letter can be confused with many structures. Therefore, optimization is often difficult to meet the best presentations. Therefore, the optimization process focuses on pattern implanting. This process underwrites most to areas of similarity in the targeted picture. ANNs learn to fine-tune different features. The most important piece is to generalize the network, which uses diverse assembling schemes to improve the subregions in the addressed picture.
For example, multimodal convolutional feature extraction, used by many technology giants, in recent models can decode many features to boost the performance. The comparison operation works on single and group classifications, leading to many performance improvements. The diagram of a multimodal DNN example is shown in Fig. 3. The fusion layer works to combine the results of two convolution channels, which extract high-level features from different sources. The resulting prediction certainty obtained from this architecture is higher than the one attained from two DNNs separately.
Fig 3 An example of a multimodal DNN that processes the sound and image simultaneously.
As a result, such DNN models often require large memory; therefore, deploying these technologies in embedded systems is difficult. Optimization methods dealing with the compression of a network can effectively reduce the memory requirements. Mainly, weight pruning is the best transformation used for this purpose. Loss functions are major counterparts in these techniques. Encoding filters with a probabilistic configuration can reduce space and allow the ANN to be more fault tolerant.
Recent works have proven that many popular ANNs are not resistant to perturbations and attacks on input data. Clearly, robustness interpretation is an important task to fulfill in future research. Much research today fooled ANNs in active, passive, quantitative, and qualitative ways. It is notable that a fooled ANN can generalize its misled classification to the entire validation. This means that the attack effusively fooled the understanding procedures. Not limited to specific inputs, research demonstrated that this confusion is transferable between output classes. Therefore, the robustness of a model also relates to its resistance to transferring the fooling effect.
In general, optimization methods that deal with ANN generalization have their own bounds directly correlated to the error value and the increase in the number of hidden neurons. In the training stage, the key perception is to depict the complexity and the shrinking speed at the hidden neuron level. The optimization is subject to the values of the top layer and the variance of the hidden layer weights with the initialization state. If the network size increases, the weights decrease.
Recently, generative models have aimed to map object distributions in the assorted data space, where inconsistencies can lead to a challenging learning exercise. The model must obtain both generated and target data sets together in the distribution space. The autoencoders used for this purpose work to standardize the generator to match the target distribution. The resulting mechanism operates to train the generative model with the maximum likelihood and clear ANNs’ inconsistent dimensionalities.
Finally, as with the current progress in ANNs, especially in AI vision applications, human insights and imagination were the primary inspiration for many advances in the field of robotics. For sure, many science fiction movies will amaze audiences with countless surprises, especially with the innovative rendering tools available in the movie industry.
• M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond Euclidean data,” IEEE Signal Process. Mag., vol. 34, no. 4, pp. 18–42, 2017, doi: 10.1109/MSP.2017.2693418.
• S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Similarity of neural network representations revisited,” in Proc. 36th Int. Conf. Mach. Learn., 2019, vol. 97, pp. 3519–3529. [Online] . Available: http://proceedings.mlr.press/v97/kornblith19a.html
• R. Luo, F. Tian, T. Qin, E. Chen, and T. Y. Liu, “Neural architecture optimization,” in Proc. 32nd Int. Conf. Neural Inf. Process. Syst. (NIPS), 2018, pp. 7827–7838. [Online] . Available: https://proceedings.neurips.cc/paper/2018/file/933670f1ac8ba969f32989c312faba75-Paper.pdf
• S. Thys, W. V. Ranst, and T. Goedeme, “Fooling automated surveillance cameras: Adversarial patches to attack person detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2019, pp. 49–55, doi: 10.1109/CVPRW.2019.00012.
• M. A. Alcorn et al., “Strike (With) a pose: Neural networks are easily fooled by strange poses of familiar objects,” in Proc. 2019 IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 4840–4849, doi: 10.1109/CVPR.2019.00498.
• D. Li, D. V. Vargas, and S. Kouichi, “Universal rules for fooling deep neural networks based text classification,” in Proc. 2019 IEEE Congr. Evol. Comput., pp. 2221–2228, doi: 10.1109/CEC.2019.8790213.
• B. Liang, H. Li, M. Su, P. Bian, X. Li, and W. Shi, “Deep text classification can be fooled,” in Proc. 27th Int. Joint Conf. Artif. Intell., 2018, pp. 4208–4215, doi: 10.24963/ijcai.2018/585.
• J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,” IEEE Trans. Evol. Comput., vol. 23, no. 5, pp. 828–841, 2019, doi: 10.1109/TEVC.2019.2890858.
• J. Heo, S. Joo, and T. Moon, “Fooling neural network interpretations via adversarial model manipulation,” in Proc. Adv. Neural Inf. Process. Syst., 2019, vol. 32, pp. 2925–2936. [Online] . Available: https://proceedings.neurips.cc/paper/2019/file/7fea637fd6d02b8f0adf6f7dc36aed93-Paper.pdf
• N. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” J. Mach. Learn. Res., vol. 17, no. 65, pp. 2287–2318, 2016. [Online] . Available: https://www.jmlr.org/papers/v17/15-535.html
Ziad Doughan (ziaddoughan@yahoo.com) earned his M.E. degree in electrical and computer engineering from Beirut Arab University in 2016. He is a Ph.D. student in Beirut Arab University, Beirut, 11072809, Lebanon. He worked as a software engineer in the Ministry of Finance between 2006 and 2008 and then as a senior in LACECO Architects and Engineers between 2008 and 2012. Currently, he is the Cost Control Team leader for the Electrical Digital Signal Processor project in Lebanon and a lecturer at Beirut Arab University. His research interests include artificial intelligence, machine learning, biomimetic systems, neural networks, and data science.
Rola Kassem (r.kassem@bau.edu.lb) earned her Ph.D. degree in control and applied computing from Nantes University in 2010. She is an assistant professor at Beirut Arab University, Beirut, 11072809, Lebanon. She was a teaching assistant between 2006 and 2009 at Institut Universitaire de Technologie de Nantes and from 2009 to 2010 at École Polytechnique de l’Université de Nantes. Her research interests include software development, embedded hardware modeling and simulations, real-time system simulation, cellular network performance, and machine learning.
Ahmad M. El-Hajj (a.elhajj@bau.edu.lb) earned his Ph.D. degree in electrical and computer engineering from the American University of Beirut in January 2014, where he held a postdoctoral fellow position between April 2014 and August 2016. He is an assistant professor at Beirut Arab University, Beirut, 11072809, Lebanon. He also worked as a lecturer from September 2015 to May 2016. His research interests include communications theory, wireless communication planning and optimization, next-generation wireless networks, bioinformatics and neuroengineering, and machine learning.
Ali M. Haidar (ari@bau.edu.lb) earned his Ph.D. degree in computer engineering in March 1995 from Saitama University, Japan. He is a professor in the Department of Electrical and Computer Engineering, Beirut Arab University, Beirut, 11072809, Lebanon. He joined Hiroshima City University in April 1995 as an assistant professor and then joined Beirut Arab University in October 1997. His research interests include logic theory and its applications, machine learning, artificial neural networks, Petri nets, computing architecture, cloud computing, digital innovation, and embedded systems.
Digital Object Identifier 10.1109/MPOT.2022.3179123