I am trying to create a recommendation systems based on Images and NLP.I have published it at:A picture is worth a Thousand Words- Lets figure out the relevant onesTitle: A picture is worth a Thousand Words- Lets figure out the relevant onesAmalgamation of Image+NLP based Neural Networks to generate meaningful image descriptions.Image Captioning refers to the process of generating textual description from an image?u2014?based on the objects and actions in the image.This is a 3 part series to implement Image Captioning as presented by Andrej Karapathy in his PHD thesis paper at Stanford.Computer Generated Captions using Neural NetsIn the process, we would learn basics of Neural Network, create a Convolutional Neural Network(CNN) in Keras(wrapper around Tensorflow), explore State of The Art NLP models(Sequence to Sequence, Glove, BERT etc) and stack together CNN and NLP model using LSTM to generate captions of an image.We would take it from there and create Recommender systems based on Pre-Trained Image and Captions vector followed by a live WebApp as the testing ground for Caption Generation as well as Recommendations.Recommendations based on pattern of first FootwearTable of Content(For Part 1):Basics Of Neural NetworksConvolution Neural Network for Image RecognitionSetting up a Google Colab NotebookCreating a Neural Network in Keras for Image ClassificationNeural Network Basics:A neural network is a type of machine learning which models itself after the human brain. This creates an artificial neural network via an algorithm which allows the computer to learn by incorporating new data.It takes several input, processes it through multiple neurons from multiple hidden layers and returns the result using an output layer. This result estimation process is technically known as u201cForward Propagationu201c.Next, we compare the result with actual output. The task is to make the output to neural network as close to actual (desired) output. Each of these neurons are contributing some error to final output. How do you reduce the error?We try to minimize the value/ weight of neurons those are contributing more to the error and this happens while traveling back to the neurons of the neural network and finding where the error lies. This process is known as u201cBackward Propagationu201c. Backward Propagation(BP) updates the weights to minimize the error resulting from each neuron.In order to reduce these number of iterations to minimize the error, the neural networks use a common algorithm known as u201cGradient Descentu201d, which helps to optimize the task quickly and efficiently. More about Gradient descent here.The aim of multiple epochs(Forward and Backward Propagation) is just to optimize the weights and biases of multiple layers so as to minimize the errors.Various Categories of Neural Nets:Convolutional Neural Network(CNN)Recurrent Neural Network(RNN)LSTM and GRUsLets delve deep into CNN here and I will cover other categories in subsequent posts.Convolutional Neural Network(CNN): They are Used largely for Image processing tasks(Classification, Object Detection, Localization etc) and constitute 4 major operations namely Convolution, Non Linearity, Pooling and Classification as described below:Convolution: The primary purpose of Convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values are only and 1:Input image as a matrixAlso, consider another 3 x 3 matrix as shown below:Randomly Initialized weight matrixThen, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation below:Image * Weight Matrix=Convolved FeatureWe slide the orange matrix over our original image (green) by 1 pixel (also called u2018strideu2019) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink). Note that the 3u00d73 matrix u201cseesu201d only a part of the input image in each stride.In CNN terminology, the 3u00d73 matrix is called a u2018filteru2024 and the matrix formed by sliding the filter over the image and computing the dot product is called the u2018Convolved Featureu2024 or u2018Activation Mapu2024 or the u2018Feature Mapu2018.Convolution operation on an image:Notice how these two different filters generate different feature maps from the same original image.In practice, a CNN learns the values of these filters on its own during the training process and the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.2. Non Linearity(ReLu):ReLu stands for Rectified Linear Unit and is a non-linear operation. The purpose of ReLU is to introduce non-linearity in our ConvNet, since most of the real-world data we would want our ConvNet to learn would be non-linear.Its output is given by:Relu OperationReLu replaces all negative pixel values in the feature map by zero.3. Pooling(To Reduce the dimensions of an image):Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max Pooling, Average Pooling etc.Max Pooling and Average Pooling4. Fully Connected(Dense) Layer for Classification:The term u201cFully Connectedu201d implies that every neuron in the previous layer is connected to every neuron on the next layer.The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset.We will use Softmax function for final classification. Softmax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.The overall training process of the Convolution Network may be summarized as below:Step1: We initialize all filters and parameters / weights with random valuesStep2: The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.Step3: Calculate the total error at the output layer (summation over all 4 classes)Total Error =? u00bd (target probability?u2014?output probability) u00b2Step4: Use Backpropagation to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and parameter values to minimize the output error.Step5: Repeat steps 2u20134 with all images in the training set.The above steps train the ConvNet?u2014?this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set.When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples)Setting up Google Colab Notebook: All You need is a Google Drive accountColab as described by Google research team is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.With Colab you can write and execute code, save and share your analyses, and access powerful computing resources(including GPU), all for free from your browser.Lets create a CNN model now for image classification based on Keras in Google Colab using GPU:Go to https://colab.research.google.co...Click on RunTime-Change RunTime Type -Select GPU for GPU activated environment. On my test runs I found GPU at least 10X faster than CPU using similar system configurations.New Notebook on Google Colab in a GPU powered environmentBuilding a CNN model in Keras for image Classification(Good Old MNIST database):#Keras Model Creationnrt kerasn keras.models import Sequentialn keras.layers import Dense, Conv2D, Flattennate modelnl = Sequential n model layersnl.add(Conv2D(64, kernel_size=3, activation='relu', input_shape=(28,28,1)))nl.add(Conv2D(32, kernel_size=3, activation='relu'))nnl.add(Dense(10, activation='softmax'))nmodel type that we will be using is Sequential. It allows you to build a model layer by layer.We use the u2018add u2024 function to add layers to our model.Our first 2 layers are Conv2D layers. These are convolution layers that will deal with our input images, which are seen as 2-dimensional matrices.u2018Denseu2024 is the layer type we will use in for our output layer. Dense is a standard layer type that is used in many cases for neural networks.We will have 10 nodes in our output layer, one for each possible outcome (0u20139 i.e 10 Digits).The activation is u2018softmaxu2019. Softmax makes the output sum up to 1 so the output can be interpreted as probabilities. The model will then make its prediction based on which option has the highest probability.# Keras Model Compilenpile model using accuracy to measure model performancenl.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])niling the model takes three parameters: optimizer, loss and metricsThe optimizer controls the learning rate. We will be using u2018adamu2024 as our optimizer since it adjusts the learning rate throughout training.The learning rate determines how fast the optimal weights for the model are calculated. Higher learning rate implies quick convergence and thus under-fitting and low learning rate implies a long time for model fitting.Learning Rate vs LossWe will use u2018categorical_crossentropyu2024 for our loss function. This is the most common choice for classification. A lower score indicates that the model is performing better. u2018Accuracyu2024 metric is used to see the accuracy score on the validation set when we train the model.#train the modelnl.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3)ne Predictions in the test setnl.predict(X_test[:4])nd on this, we are able to see a 97% accuracy on our test dataset. My Colab notebook for codes:Google ColaboratoryEdit descriptioncolab.research.google.comIn the next Part, we will work on Flickr 30K dataset http://bryanplummer.com/ which has 5 captions associated with each image. We will build an Encoder-Decoder network followed by Supervised Training based on Captions to generate custom captions based on our image inputs.Thanks a lot for reading.Happy Learning!!!References:https://github.com/karpathy/neur...https://towardsdatascience.com/a...https://www.analyticsvidhya.com/...https://towardsdatascience.com/i...https://ujjwalkarn.me/2016/08/11...https://medium.com/nybles/create...https://towardsdatascience.com/i...https://towardsdatascience.com/b...https://towardsdatascience.com/c...https://towardsdatascience.com/g...