Machine Learning (ML) and Natural Language Processing (NLP) have been grabbing a lot of headlines lately. I have been keeping up with these topics, but mostly on an abstract level, enough to pass a test on the subject. As a software developer, however, I felt it important to get some hands-on experience writing code. This will allow me to develop a deeper understanding of this topic and provide value to any future team working in this area.
With this in mind, I present the following curriculum that can help you, too, achieve this goal:
- Data science trail on RealPython.com (for brushing up on Python)
- Hugging Face NLP tutorial
- A Cloud Guru tutorial for TensorFlow
- Review the literature on NLP, especially “Attention Is All You Need”, which details the development of the Transformer architecture
The idea here is to take myself through a series of learning tasks that build on one another. After walking this path, I can say confidently that I have learned a number of practical lessons that I will actually use, not least of which is a better understanding of the philosophy of Python and some of the particulars of its use. The following sections are a narrative account of my experiences and observations.
At the beginning of my self-directed study, I was excited to start with some basic Python and Data Science with Python skills, as these are things that I may actually be able to put directly into practice at my daily job. On the first day, I started Real Python, got an idea of what’s available and start into the basics. I went over my proposal for the project, started some notes, and acquired the subscription to RealPython.
The first lecture I chose to do was Reading and Writing CSV Files. I’ve done this before in Python in real life, so this was a way to warm up. Setting up my development environment was also part of this exercise. I decided against PyCharm for getting started because I felt this would be overkill for now. Instead, I spend some time getting familiar with IDLE and getting it to work on my machine, which involved some wrestling with pipenv. Pipenv and environment standardization in general has always been a weak point for Python. I feel like I do not even have to qualify that with an “in my opinion” because the experience when compared to my daily driver language, Java, is night and day. That said, having the time now to sort out the mechanics a bit better, I got IDLE working in a reasonable amount of time and once that was done I breezed through the CSV lecture.
Most of my Python experience up to this point had taken place either using IntelliJ and the command line or in LeetCode. I had not even really been aware of the fact that code completion exists in Python. While I found the level of code completion in Python is less than amazing in my experience in the ensuing days, it’s still really helpful, so that was a bit of an eye opener. That said, I would quickly move past IDLE.
The first listed item in RealPython in the data science path is, as of this writing, the Using Jupyter Notebooks lecture. I’ve seen Jupyter Notebooks before and had mostly encountered them as hosted resources, but this course showed me how easy it is to run locally. After some fiddling around, I managed to install Jupyter Notebooks and run it in a project using pipenv, which I found satisfying because now it could be managed as a git repository.
I used Jupyter Notebooks as my development platform for the remainder of the two weeks and found it robust enough to be able to follow along with the code in all of the coming materials. There are some things I would consider essential to running Jupyter notebooks locally and in a git project:
- Jupyter Notebooks buffers the contents of a running notebook in the browser and does not immediately reflect any changes in the file system like some IDEs. This can lead to bad commits if you are not paying attention.
- Jupyter Notebooks saves the output of the cells to the notebook format as well, meaning that if you have a bunch of data on the screen, including any images, it will encode those items into a textual format and save them to the disk. This makes things awkward in git sometimes. Generally I clear the outputs before I commit.
- If running with pipenv, you will need to install dependencies and restart Jupyter Notebooks as you discover them. I developed a template script to pre-install some of the usual suspects, but I later found that pipenv is saving a brand new copy of every dependency for each pipenv project, so I’d say one needs to be judicious about how many projects to start this way.
Jupyter Notebooks is not a solution for production code either, of course. It is inherently a learning and demonstration tool, but I found it quite useful. The version I used has code completion even.
Also on the first day, I went through a lecture comparing the Python vs Java: Object Oriented Programming lecture. This short lecture was very useful in making connections for me and I felt like I understood some of the structure of Python better. I had only a vague notion of dunder methods before this and I practiced some of the common ones later in a LeetCode exercise later that evening like __hash__ and __eq__ for getting custom objects into a set in a meaningful way and like __lt__ overriding the < operator so that heapq could sort custom objects.
After some of the tooling and basic mechanics were out of the way on the first day, it was time to start looking properly at the data science tools. Pandas was first on the list and I found the tool to be a little like using a normalized database except in memory in Python. The ability to get aggregate statistics, slice up, and display data in this framework is quite impressive and can often be the first step in ingesting data for use in an ML pipeline or for conventional statistical analysis. Thinking back to some of the scripts I have written for data processing in the past, sometimes I think I may have been replicating functionality that exists in Pandas. That said, Pandas is clearly a complicated tool with a learning curve, so perhaps sometimes the choice between reinventing the wheel and learning a tool is not as simple as it sounds for an infrequent task.
The spaCy Library
After some more basic mechanics courses like working with JSON, I moved on to the first NLP specific lecture: Natural Language Processing with spaCy in Python. The spaCy framework has an awkwardly formatted name. I do not claim to be familiar with the exact thinking behind this, but it seems pretty obvious that they are emphasizing the C and that this is likely because they want the developers that use it to know that they have not implemented it completely in Python.
I was discussing the progress of this course of study I had put myself on with my Brother, who has experience adjacent to these topics. We discussed spaCy and he pointed out that Python would be a pretty poor language for vector math, which all of this machine learning activity ultimately boils down to. The spaCy framework’s website notes that it “is written from the ground up in Cython,” a.k.a. C-Extensions for Python. This is a pretty common pattern for Python, where the high level tasks are arranged in Python code and the tasks that most require optimal execution are handed over to a lower level of abstraction language like C. Indeed the most common Python interpreter, CPython, is itself written naturally in C, often calling back internally in this same way. The Recursion in Python lecture demonstrated just how often this might happen without any external indication by implementing a factorial function in Python, timing it, then timing the native Python factorial method, noting that the native function took orders of magnitude less time to execute. Contrasting this with some implementations of Java, like OpenJDK, from experience I’ve seen that a lot of the JRE often is implemented in its own idiom, but this can differ by implementation I would imagine.
The spaCy framework does provide quite an interesting starting point for NLP. The casual starting developer has no need to engage in any model training, but instead loads a language model by name when calling the load() function on the library. When the developer calls the function provided by load(), the framework populates a document object, which contains a set of baseline functionality: the document is broken down into sequences, which are composed of tokens. Each token has attributes relating to what they are and the relationship between the token and its context.
Another interesting feature of the spaCy framework I looked at is the sentence structure graphs. This subject deserves a more thorough explanation than I can give, but the framework provides a tree structure to represent the tokens it finds in a sentence in natural language. The nodes are tokens and the vertices are relationships like ‘nsubj’ and ‘comp’, which are word features and relationships that have been identified by scholars in the field of linguistics and modeled in the framework (https://spacy.io/usage/linguistic-features). There are a lot of ways to build custom analysis and processing into spaCy, and I’ve only taken a quick look at the very most basic parts. It is clear that you can go pretty deep with just spaCy and the ability to programmatically navigate sentences seems to have tantalizing possibilities, but my survey of NLP technologies must continue.
The last thing I did on day 2 was to read a couple of articles on RealPython regarding gradient descent. I read one on an annotated implementation of stochastic gradient descent, which reminded me of how long it has been since I did any calculus. The basics were familiar enough: you need a loss function to determine the gradient, or in other words ‘direction’, of better answers using your training data. Stochastic gradient descent achieves this roughly by shuffling training data, calculating the error for a given epoch, and comparing that error to the previous epoch. At a super high level, this is the standard approach taken by most machine learning algorithms that utilize neural networks. No complex manual partial derivatives problems in sight for me, thankfully, but this has been implemented by the base layers of the existing tensor manipulation frameworks.
Natural Language Processing
On the third day, I started the Hugging Face NLP course. The course seems not to be complete as of this writing, missing some parts from the advanced section, but two of the three missing seem related to applications that are secondary to my interests for this training. HuggingFace offers hosted Jupyter Notebooks as part of its course materials, but I decided to run locally. I cloned their main repository, which has the course materials on it and went forward from there. The only challenge involved there was downloading the multi-hundred megabyte model files, but the HuggingFace pipeline allowed me to do this through the notebooks, so it was really just a matter of waiting out the downloads.
In some ways this is a dependency manager akin to Maven is for Java, fetching models, tokenizers, and other components as needed by name in a project, although the analogy is not perfect. HuggingFace maintains the ‘transformers’ library, and it’s accessible like any other pip library. The organization and back end infrastructure backing this library are necessarily tightly controlled by HuggingFace, a for-profit organization, so some level of awareness is required there.
HuggingFace has set up this concept of a “pipeline” where you can reference a particular activity like “text-generation” by name, get a python object already loaded up and ready, then start firing prediction requests at it. They have set up a very low barrier to entry for anyone that just wants to write some Python code to use a machine learning model. They do not even necessarily require you to know what tensor library you are using. HuggingFace seems to have broad support for PyTorch and TensorFlow, and you actually do need to know one or the other is there in order to run the code, but you do not need to directly engage with it to use a pipeline.
Although you might use HuggingFace just as a means to run ML models without looking under the hood, they have created the means to drill down to almost any level of the underlying abstraction. This is helpful, as I would discover later, as directly configuring things like this in Tensorflow, even with the help of Keras, can be daunting. One of the early examples was naming a checkpoint (a set of weights associated with a particular model architecture, already trained up to some point), then using the AutoTokenizer class to identify the specifications required to convert raw inputs into tokens in the specific format required by the model. Calling the given tokenizer dutifully returns a reference to a tensor that can be pushed into a training or prediction call.
A lot of the first chapter was focused on general information, description of transformers and NLP, and some of the well defined NLP tasks that exist and are covered by HF resources. HF also often emphasizes the social and environmental aspects of machine learning. For example, they point out that the models are known to have biases, but the discussion does not go much deeper than that. They point out that training a large model from scratch consumes a lot of energy, and for this problem they suggest leveraging transfer learning so as to reuse resources already paid for. It is interesting because clearly this is not an ethics course, but it is good that they point out issues exist at least.
After some of the basic hello world style initiation from HF, day 4 was getting down to some specifics on how to use transformers, lots of low level of abstraction detail on how different NLP tasks work. I took quiz questions seriously and often had questions, which often served as a way to go back to the material again. Occasionally there would be a question that seemed to cover material that appears later in the course than the question itself, particularly one about specific tokenizer strategies that appeared way before chapter 6 where those were explained, but these problems were rare enough that the questions seemed valuable.
A lot of this time was spent taking notes, executing the given code, and playing around with that code. I was particularly interested to go over the rationale for different tokenizer strategies, as they are traditional algorithms built in a manner I am most familiar with.
“Attention Is All You Need”
The days of the simple two-layer neural network are long gone and there have been a few major iterations on the major components of a neural network architectures commonly used by the developers of machine learning systems. The 2017 paper Attention Is All You Need marked one of the latest of those components to emerge, introducing the concept of the Transformer. The authors of the paper compare the performance of their work to that of other known model architectures, including convolutional and recurrent neural networks, the theretofore leading edge of model architectures for many purposes. Transformers operate using a concept referred to as ‘attention’ where the model learns to associate the strength of the relationship between each token in the input with each other token in the input. This information can help subsequent layers train on more specific information with regard to the meaning of tokens in context instead of individually. Understanding the concepts in this paper fully is quite an undertaking and I can honestly say that I have not nearly reached that point.
In the machine learning context, especially for NLP, transformers seem to be considered the architecture of choice, replacing recurrent neural networks and convolutional neural networks (again, for NLP tasks, one should probably point out). The case for Transformers seems mostly to do with algorithmic efficiency, not necessarily capability, given sufficient resources, but the authors also put forward claims that reducing “maximum path length” (i.e. having a flatter structure resulting in less information loss) has an impact on capability. I should point out that multi-head attention and the manner of encoding positional information also are widely held to be contributions made by this paper.
Computing power is however one of the primary limiters of machine learning performance, rivaled perhaps only by the accessibility of training data (https://ide.mit.edu/wp-content/uploads/2020/09/RBN.Thompson.pdf), so improvements even on algorithmic efficiency can effectively translate to more powerful models for the same dollar spent training. The authors of “Attention is All You Need” benchmarked their Transformer against the existing RNN and CNN model architectures and claim to have achieved something along the lines of an order of magnitude less computation required for comparable performance on the benchmark criteria (BLEU in this case, a language translation test).
For the most part, the HuggingFace course kept the level of abstraction over the different parts within any given model architecture. They covered the different functionalities of the plethora of architectures they support, but the focus mainly seems to be on writing other pipeline components like tokenizers, training and tuning models on established architectures, and applying known NLP workflows. To be fair, there is plenty of complexity and range for different use cases there, but the lowest level of abstraction.
The primary distinction for frameworks at the level of TensorFlow is to provide the tools needed to break down and distribute the computation tasks associated with crunching tensor calculations on specialized hardware. It can run optionally on a cpu in your local development environment, but this is not the use case the software was built for. The framework uses directional acyclic graphs to model the steps undertaken in performing the training activity so that the results of calculations whose inputs do not depend on one another can happen in parallel.
Keras on the other hand is a part of Tensorflow that exists to facilitate ease of use, containing utilities for loading data, building model architectures, accessing common tokenization strategies, and such things. Keras is like HuggingFace in some ways and was the primary mode of interacting with TensorFlow that I used during the ACloudGuru course that I did. Much of the learning in this course I engaged in was by repetition of steps and code explained by the lecturer. Taking in data, ‘wrangling’ it into tensors, creating test and training datasets, arranging model layers, and decoding results were all steps performed in different iterations throughout. I learned a lot, but there are still many details I need to follow up on.
Machine learning processes do seem quite familiar in some aspects to the kind of work I am accustomed to doing: writing code to extract and transform data is quite familiar, as are the troubleshooting sessions I found myself occasionally doing, trying to debug a particular piece of code. I found Python to be occasionally quite inexplicable at times when things would fail, but perhaps that is a matter of familiarity as well. There are also factors in machine learning that I am not accustomed to, like how squishy it seems this prediction generation process seems to be: if I write a piece of regular enterprise software functionality, I can show you it works because I can write a test for it or I can deploy and let you use it. Proving a machine learning model works seems like a much trickier proposition to me, as it is inherently probabilistic. Also while many of the mechanisms used in the process seem well thought out, I cannot escape the sense that a lot of it is hand waving; why is it exactly that we pick one tokenization strategy over another really? Why is it that we are comfortable with the idea of stochastic gradient descent when the parameters for how far it looks and for how long are just our best guesses in any given situation? The same thing goes for how many layers, how many neurons, etc, all established by experimentation and intuition. Much of this process seems untamed to me compared to traditional software development.
If your organization has an idea that could use ML or NLP, but you are not sure how to get started, contact us for a free consultation.