Jekyll2021-01-23T21:40:12+00:00https://gbpcosta.github.io/feed.xmlGabriel B. Paranhos da Costa’s WebsitePersonal websiteGabriel B. Paranhos da CostaData Science 101 - Part 22020-04-27T00:00:00+01:002020-04-27T00:00:00+01:00https://gbpcosta.github.io/blog/data-science-101-part-2<p>This is the Part 2 of a series of posts that cover basic concepts of Data Science. In this post, I will focus on Machine Learning and define the main learning strategies that these methods use. I will also try to illustrate these strategies by giving examples of when each one is used. The main objective here is to give an overview of each learning strategy and not to go into details (yet).</p>
<p>Part 1 of this series covered the main areas within Data Science, a few commonly used terms and applications. Part 1 can be found <a href="/blog/2020-04-14-data-science-101-part-1/">here</a>.</p>
<hr />
<h1 id="machine-learning---learning-strategies">Machine Learning - Learning strategies</h1>
<p>Machine learning’s main goal is to develop algorithms that are capable of creating models that can make non-trivial decisions without explicit instructions. This is usually done by learning through past experience or by making assumptions as to how each decision should be made. Many different approaches can be used to achieve this. The most common approaches can be put into one of these four categories: Supervised Learning; Semi-Supervised Learning; Unsupervised Learning; and Reinforcement Learning.</p>
<h2 id="supervised-learning">Supervised Learning</h2>
<p><strong>Supervised Learning</strong> focuses on creating models when there is data available stating which decisions should have been made on past observations. That is, Supervised Learning methods require a dataset with <strong>labelled examples</strong>, where each labelled example consists of the information used to make the decision (commonly called <em>X</em>) and the label that indicates the decision that should be made (usually represented by <em>y</em>). This approach can be seen as a parallel to a person learning a certain topic by studying a set of questions and answers on that topic and then proceeding to answer never before seen questions on the same topic.</p>
<p><strong>Classification</strong> is a task usually addressed with Supervised Learning algorithms. In this task, the machine learning model (<em>classifier</em>) is given an example (also called observation) and is asked to place this example in one (or more) category (a.k.a. <em>class</em>). The set of possible categories is predefined and the dataset used to train the classifier needs to contain a representative sample of every class. Binary classification tasks focus on predicting a single binary output (e.g. Yes / No), while multi-class classification allows more than two categories. Another problem that is tackled by Supervised learning methods is <strong>Regression</strong>. This is similar to Classification, however the output of the model is continuous instead of categorical. For example, consider a supervised model that is trained on a dataset of dog images to make a decision, based on their appearance, which is each dog’s breed. In this scenario: each image is used as the information that led to a decision (<em>X</em>); the decision itself is which is the dog’s breed (<em>y</em>); the set of possible classes is defined by all the dog breeds included in the dataset used to train the model. Another example would be to train a supervised model to predict the price of a house given the house’s characteristics, such as number of bedrooms, neighbourhood and size. In this case, the list of characteristics is used to make a decision (<em>X</em>); and the decision is how expensive is each house (<em>y</em>). These examples illustrate a classification and a regression task, respectively.</p>
<p>Another learning approach I consider to be part of Supervised Learning is <strong>Collaborative Filtering</strong>. Collaborative Filtering is used by many <strong>Recommendation Systems</strong> as a way to predict a user’s behaviour based on the behaviour of similar users and the user’s behaviour towards similar objects. I will go into more details on how this works and other ways to train Recommendation systems on another post.</p>
<h2 id="unsupervised-learning">Unsupervised Learning</h2>
<p>While Supervised Learning uses labelled examples to learn from past experience, <strong>Unsupervised Learning</strong> focuses on cases where labels are not available and finds patterns in the data based only on the information available for each observation. This is usually done by making assumptions about how the data is organised, such assuming certain characteristics about the statistical distribution the data is sampled from. Unsupervised learning algorithms are specially useful when labels are not available and/or hard to obtain, when doing exploratory analysis of datasets and to make changes to the dataset with the goal of simplifying or cleaning it.</p>
<p>The most common application of Unsupervised learning algorithms is <strong>Clustering</strong>, which creates groups (clusters) of examples based on similarity. That is, Clustering algorithms group examples in a way that examples in the same cluster are more similar than examples in different clusters, in other words, these methods maximise <em>intra-cluster</em> similarity while minimising <em>inter-cluster</em> similarity. <strong>Anomaly Detection</strong> is also commonly addressed as an Unsupervised learning problem (even though it can be treated as a Supervised or Semisupervised problem too) where the algorithm tries to create a model that defines what <em>normal</em> behaviour looks like and any examples that do not fit that model are considered to be <em>anomalous</em>. These anomalous examples are often called: anomalies, outliers, novelties or exceptions. Anomaly detection is frequently used for problems such as bank fraud, medical diagnosis and fault detection.</p>
<p>Lastly, most Machine Learning methods are not able to deal with the raw data directly, for example a text or an image. These methods need a numerical representation (commonly called <em>feature vector</em>) for each example which can be created in several different ways (which I will address in a future post). Creating these feature vectors generates a <em>feature space</em> where all the examples in a dataset are projected. However, the created feature space might contain redundant information and/or noise. Unsupervised learning algorithms can be used to make changes to the feature space making it easier to understand or visualise (<em>dimensionality reduction</em>); or to remove unwanted characteristics or information contained in that space. A very well known unsupervised learning method that can be used for these purposes is <strong>Principal Component Analysis</strong> (PCA), which finds dimensions in the feature space in way that the dimensions are sorted by how much data variability they capture. Again, I will go into more details on how this method works in a future post, when talking about <em>feature projection</em>.</p>
<h2 id="semisupervised-learning">Semisupervised Learning</h2>
<p><strong>Semisupervised Learning</strong> is a grey area between Supervised learning and Unsupervised learning where methods use weaker types of supervision (e.g. noisy or imprecise labels), a mix between labelled and unlabelled examples, or information that is not part of the dataset as a proxy for labels (e.g. descriptions of each label). These methods are usually a good option when having access to a large quantity of unlabelled samples but only a small amount of labelled samples is available, specially when labelling is too expensive. Semisupervised learning algorithms usually tackle similar problems to Supervised learning, like <strong>Classification</strong> and <strong>Regression</strong>. However, these methods can also be used in other tasks such as to extend the available labels to unlabelled examples automatically (for example, by using <strong>Label Propagation</strong>), which can then be used on a Supervised method.</p>
<h2 id="reinforcement-learning">Reinforcement Learning</h2>
<p>Finally, <strong>Reinforcement Learning</strong> focuses on goal-centred algorithms that maximise the rewards obtained based on policies over several steps. That is, one or more reward functions are predefined based on the task at hand and the model is trained to maximise the rewards gained. These algorithms are used for several different applications but got particularly famous for their applications in robotics (<a href="https://deepmind.com/blog/article/producing-flexible-behaviours-simulated-environments">like Deep Mind’s model that learn to walk and overcome several obstacles</a>) and gaming (e.g. <a href="https://www.youtube.com/watch?v=qv6UVOQ0F44">the AI trained to play Super Mario Bros</a>).</p>
<p>There are different ways of designing Reinforcement Learning algorithms, the most common being <strong>Temporal Difference</strong>, which updates the rewards and evaluates the decision making at every <em>step</em> (instead of evaluating entire <em>episodes</em> - from start to end state). Within Temporal Difference, two types of learning take prominence: (1) <strong>State-Action-Reward-State-Action</strong> (SARSA); (2) <strong>Q-Learning</strong>. SARSA is an <em>on-policy</em> temporal difference approach where a policy is a state-action pair, that is, each state defines the action to be taken. Q-Learning is an <em>off-policy</em> temporal difference approach, very similar to SARSA. However, Q-Learning methods do not follow a policy to define which is the next action to be taken, but instead chooses the next action using a greedy policy, by always taking the action that gives the best reward at that moment.</p>
<hr />
<p>On this post I went over the four main learning approaches used in Machine Learning by giving a quick overview of each and giving simple examples of when they are used. On my next post I will go over other learning strategies that are commonly used in Machine Learning. These learning strategies are complementary to the ones described in this post and often define a subset of methods within these.</p>Gabriel B. Paranhos da CostaThis is the Part 2 of a series of posts that cover basic concepts of Data Science. In this post, I will focus on Machine Learning and define the main learning strategies that these methods use. I will also try to illustrate these strategies by giving examples of when each one is used. The main objective here is to give an overview of each learning strategy and not to go into details (yet).Data Science 101 - Part 12020-04-14T00:00:00+01:002020-04-14T00:00:00+01:00https://gbpcosta.github.io/blog/data-science-101-part-1<p>It is not uncommon for people to use different data science terms interchangeably, without really knowing that most of those terms have (some times slightly, most times completely) different meanings. In a series of posts I will try to define commonly used terms and give an overview of how I approach data science projects.</p>
<hr />
<p><strong>Data Science</strong> itself is a broadly used term, often used with many different meanings in mind. The main definition of Data Science is an interdisciplinary field that uses knowledge from Computer Science, Statistics, Information Sciences and application specific fields (among others) to extract insights and create models based on previously observed phenomena, all this while using scientific methods. The responsibilities of a data scientist varies a lot from company to company and even within the same company, on different teams or different projects.</p>
<p>Data Science is often used as a synonym for Data Mining, Data Analysis, Artificial Intelligence and Machine Learning. This is mostly because data scientist do use methods and knowledge from each of these fields while doing their work, however they are not exactly the same. <strong>Data Analysis</strong> consists of the methods and tools used to create a better understanding of a phenomena by considering recorded examples, and, from those, compute several measurements and create visualisations that allow the user to assess large quantities of information and data, which he/she would not be able to otherwise. These methods are regularly used by data scientists is many different situations such as: when the main objective is to extract insights from stored information; when starting a new project; and when assessing the quality and biases of machine learning models. <strong>Data Mining</strong> is field that focuses on how to combine. filter and transform existing information in order to generate new, clearer, information. These method are commonly used as part of Data Analysis projects.</p>
<p><strong>Artificial Intelligence</strong> (AI) is the name given to the Computer Science field that studies and develops computer systems that are capable of performing tasks which require some kind of <em>human</em> intelligence. These programs or models are able to solve complex problems such as translating texts to different languages, identifying objects in a video and understanding requests made through voice commands. Among different Artificial Intelligence sub-fields, <strong>Machine Learning</strong> (ML) has taken a position of prominence. This sub-field focuses on the development of computer programs capable of making complex decisions by learning from past experience, that is, without explicit instructions. This is usually done through statistical and optimisation models and can be applied to many different problems, the only constraint being that data that is relevant to the underlying task needs to be available to train the model.</p>
<p>As is with Artificial Intelligence, Machine Learning also has many different sub-fields, like Statistical Learning Theory, Clustering and Artificial Neural Networks. <strong>Artificial Neural Networks</strong> (ANN, sometimes just called Neural Networks - NN) have existed for a long time but recently became famous due to the results achieved by Deep Learning methods. Artificial Neural Networks are models that are <em>vaguely</em> inspired on biological neural networks (brains!) and are usually composed by a set of computing units (neurons) which are connected to each other creating a network of units, from input to output. When these units are organised in a series of layers (and the number of layers start to grow), these models are known as <strong>Deep Learning</strong> (DL). Deep Learning methods create an hierarchy of concepts (each concept learned by a computing unit) by learning simpler concepts on units on the first layers (closer to the input), and combining those into more complex concepts the deeper into the network you go.</p>
<p>All these methods can be used in many different applications, such as Natural Language Processing, Computer Vision, Genomics and Financial Services. Since most of my experience is on Natural Language Processing and Computer Vision, I will focus on these two when describing applications and creating examples. <strong>Natural Language Processing</strong> (NLP) combines Linguistics and Computer Science with the aim to allow computers to interact with human languages. Applications explored by Natural Language Processing include Translation, Summarisation, Text Generation and Conversational Agents (A.K.A. chatbots). <strong>Computer Vision</strong> (CV) main goal is to enable computers to understand visual cues, typically in images or videos. That is, Computer Vision methods automates tasks usually performed by the human visual system. Some of the tasks these methods solve include Face Recognition, Object Detection, Motion Analysis and Image Restoration.</p>
<p>I created a Venn diagram to illustrate how some of these fields and sub-fields interact with each other. In this diagram, other application related fields could replace Natural Language Processing or Computer Vision. Several other subfields could also be included, however I did not want to make this over complicated.</p>
<!-- <figure class="">
<img src="/assets/images/data_science_venn.png"
alt="Venn diagram showing several data science related areas">
<figcaption>
Venn diagram showing several data science related areas.
</figcaption>
</figure>
-->
<p><img src="/assets/images/data_science_venn.png" alt="Venn diagram showing several data science related areas" class="center-image" /><em>Venn diagram showing several data science related areas.</em></p>
<hr />
<p>In this post, I tried to go over the areas within or related to data science by giving a simple definition of each and showing how they relate to each other. On my next post, I will go to focus on Machine Learning specifically by defining different learning approaches.</p>Gabriel B. Paranhos da CostaIt is not uncommon for people to use different data science terms interchangeably, without really knowing that most of those terms have (some times slightly, most times completely) different meanings. In a series of posts I will try to define commonly used terms and give an overview of how I approach data science projects.Installing CUDA 7.5 in Ubuntu 14.042016-03-22T00:00:00+00:002016-03-22T00:00:00+00:00https://gbpcosta.github.io/blog/installing-cuda-7-5-in-ubuntu-14-04<p>After some problems installing CUDA 7.5 in Ubuntu 14.04, facing login loops and other problems, I found a solution on a NVIDIA forum (link in the end of the post). This procedure installs the graphics card for CUDA purposes, not allowing its use for rendering.</p>
<ol>
<li>Install the build-essentials package by running:</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> <span class="nb">sudo </span>apt-get <span class="nb">install </span>build-essential</code></pre></figure>
<ol>
<li>
<p>Download the CUDA installation (.run) file from the NVIDIA website (<a href="https://developer.nvidia.com/cuda-downloads">Link</a>).</p>
</li>
<li>
<p>If you have a xorg.conf file, remove it (backup the file first to avoid any problems):</p>
</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> <span class="nb">sudo mv</span> /etc/X11/xorg.conf.bkp</code></pre></figure>
<ol>
<li>Create the /etc/modprobe.d/blacklist-nouveau.conf file containing:
<pre>
blacklist nouveau
option nouveau modeset=0
</pre>
</li>
<li>The run the following command and reboot your computer.</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> <span class="nb">sudo </span>update-initramfs <span class="nt">-u</span></code></pre></figure>
<ol>
<li>
<p>When the login screen appears, type Ctrl + Alt + F1 to access the text interface and login with your user.</p>
</li>
<li>
<p>Go to the directory where the CUDA .run file was downloaded and run:</p>
</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> <span class="nb">chmod </span>a+x CUDA_RUN_FILE</code></pre></figure>
<ol>
<li>Stop the lightdm service and run the CUDA installation file (<strong>OpenGL libs should not be installed!</strong>):</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> <span class="nb">sudo </span>service lightdm stop
<span class="nb">sudo </span>bash cuda-7.0.28_linux.run <span class="nt">--no-opengl-libs</span></code></pre></figure>
<ol>
<li>
<p>During the installation you should <strong>ACCEPT</strong> the EULA conditions, say <strong>YES</strong> to installing the NVIDIA driver, <strong>YES</strong> to installing the CUDA Toolkit and Driver and <strong>YES</strong> to installing the CUDA Samples. If asked, you should say <strong>NO</strong> when asked if you want to rebuild any Xserver configurations.</p>
</li>
<li>
<p>After the installation is complete, check if device nodes are present (if nothing happens, it means everything is ok!):</p>
</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> <span class="nb">sudo </span>modprobe nvidia</code></pre></figure>
<ol>
<li>Set environment path variables (this can be added to the end of the XXXXXX file so that there is no need to rerun this commands every time you restart your computer). Remember to change YOUR_CUDA_PATH to the right path, depending on your CUDA version (e.g. /usr/local/cuda-7.5):</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> <span class="nb">export </span><span class="nv">PATH</span><span class="o">=</span>YOUR_CUDA_PATH/bin:<span class="nv">$PATH</span>
<span class="nb">export </span><span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span>YOUR_CUDA_PATH/lib64:<span class="nv">$LD_LIBRARY_PATH</span></code></pre></figure>
<ol>
<li>Verify the NVIDIA driver version and the CUDA driver version:</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> <span class="nb">cat</span> /proc/driver/nvidia/version
nvcc <span class="nt">-V</span></code></pre></figure>
<ol>
<li>(Optional) Now, you can switch the lightdm back on. You should then be able to login through the GUI without any problems:</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> <span class="nb">sudo </span>service lightdm start</code></pre></figure>
<p><strong>To test the CUDA installation:</strong></p>
<ol>
<li>Create the CUDA Samples by accessing its folder and running:</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> make</code></pre></figure>
<ol>
<li>Go to the release folder inside the CUDA Samples folder (e.g. NVIDIA_CUDA-7.5_Samples/bin/x86_64/linux/release/) and run two standard checks (first to see your graphics card specs and then to check if its operating correctly). Both tests should ouput ‘PASS’:</li>
</ol>
<figure class="highlight"><pre><code class="language-sh" data-lang="sh"> ./deviceQuery
./bandwidthTest</code></pre></figure>
<p>Everything should be ok after you reboot.</p>
<p><em>Source: <a href="https://devtalk.nvidia.com/default/topic/878117/-solved-titan-x-for-cuda-7-5-login-loop-error-ubuntu-14-04-/">Titan X for CUDA 7.5 login-loop error</a></em></p>Gabriel B. Paranhos da CostaAfter some problems installing CUDA 7.5 in Ubuntu 14.04, facing login loops and other problems, I found a solution on a NVIDIA forum (link in the end of the post). This procedure installs the graphics card for CUDA purposes, not allowing its use for rendering.