Creating Machine Learning Models with CreateML

I have been following the bare outlines of building, and using, machine learning models in Apple’s software ecosystem for a while. Most of my learning and personal research has been with foundational technologies – following some of the frameworks (TensorFlow, PyTorch, SciKit-Learn) and some of the advances in models and their results. Until this holiday, I had not applied myself to seeing the latest evolution of Apple’s machine learning frameworks and tooling. I scanned through websites last summer during Apple’s WWDC, but didn’t really clue in to the changes that were coming, and that are now available, with Big Sur (macOS 11).

Aside: The web site Papers With Code has been a great consolidated reference point for digging into the academic research behind quite a variety of machine learning advancements over the past couple of years, and includes both the research papers (often linked to ArXiv) and frequently references to the code matching the papers.

First off, the tooling to create simple models with CreateML has received quite a boost. Matched by the documentation, there are a large number of new models that can be easily generated with you “just” providing the data. Models to classify sounds, motion data, and tabular data – including regressors (predictors) as well as classifiers. Prior to macOS 11, they had models for classifying and tagging words and sentences, as well as image classification. Those all still exist, and Apple provides some ready-to-use models built-in to frameworks such as Natural Language.

The goal I chose to explore with was focused on natural language processing – more on that later. Apple has long had a pretty good natural language library available, the earlier stuff being a bit more focused on Latent Semantic Mapping, but recently turning to – and leveraging – quite a bit of the natural language processing advances that have happened using more recent machine learning techniques.

The most interesting win that I immediately saw was the effectiveness of using transfer learning with CreateML. When you’re making a model (at least the word tagger model), you have the option of using a CRF (conditional random field) or applying the data with a transfer learning over existing models that Apple has in place. They don’t really tell you anything about how these models are built, or what goes into choosing the internals, but the bare results from a few simple experiments are positive, and quite obviously so.

I made sample models that mapped language dependency mapping from some publicly available datasets provided by Universal Dependencies. Dependency maps specify how the words relate to each other, and the part that I was specifically interested in was leveraging the ability to identify a few of those specific relationships: nsubj:pass and aux:pass. These are the tell-tales for sentences that use passive voice. I did multiple passes at making and training models, and transfer learning was clearly more effective (for my models) in terms of the reported accuracy and evaluation.

Training a Model

One of my training runs used approximately 5000 labeled data points (from About 4200 of those points I allocated for training, and another separate file with 800 data points for testing. In this case, a “data point” is a full or partial sentence, with each word mapped to an appropriate dependency relationship. The CRF model topped out at 81% accuracy, while the transfer learning model reached up to 89% accuracy.

Conditional Random Field Model: 11 iterations planned, converging “early” at 7 iterations.
Transfer Learning: Dynamic Embedding (25 iterations)

To put the accuracy in perspective, the latest “state of the art” parts-of-speech tagging models are running around 97 to 98% accuracy, and around 96% for dependency parsing.

Timing the Training

I ran these experiments on one of the new M1 MacBooks with 16GB of memory. This training set took just over a minute to train the CRF model, and 7 minutes to train the transfer learning model. A similar run with more data (27,000 data points) took 20 minutes for the CRF model, and 35 minutes for the transfer learning model. Transfer learning takes longer, but – in my cases – resulted in better accuracy for predictions.

Evaluating the Model

Once trained, CreateML provides an evaluation panel that gives you precision and recall values for each value that in your tagging set. This information is ideal for understanding how your data actually played out versus what you were trying to achieve. Using it, you can spot weak points and consider how to resolve them. One way might be gathering more exemplar data for those specific classifications. For example, in the following data table, you can see that the recall for “csubj:pass” was 0%. This showed that I simply didn’t have any examples in the test data to validate it, and possibly only a few samples in my training data. If that was a tag I was interested in making sure I could predict from input text – I could find and improve the input and testing data to improve that accuracy.

Previewing Predictions from your Model

Probably the most fun of using CreateML is the preview mode. Since I was making a word tagging model, it provided a text entry area and then applied the tagger against the data, showing me an example of the predictions against what-ever I typed.

Since I’d done two different models in the same project, I could flip back and forth between them to see how they fared against this sample, showing me the predictions from the model and the associated confidence values of those predictions. (and yes, I have considered locking my brother into a kitchen, he’s a damned fine chef!)

Exporting a CoreML model from CreateML

CreateML includes an Output tab that shows you the tags (also known as class labels) that you trained. It also gives you a preview of the metadata associated with the model, as well as kinds of inputs and outputs that the model supports. This makes it nicely clear on what you’ll need to send, and accept back, when you’re using the model in your own code.

One of the details that I particularly appreciated was including a metadata field explicitly for the license of the data. The data I used to create this model is public, but licensed to by-nc-sa-4.0 (An attribution, non-commercial, share-alike license). Sourcing data, both quality and licensed use, is a major element in making models. I think it’s super important to pass that kind of information along clearly and cleanly, so I’m glad it’s a default metadata attribute on models from CreateML.

Notes on CreateML’s Rough Edges

While CreateML was super easy to use and apply, it definitely has some issues. The first is that the user interface just doesn’t feel very “Mac” like – things I expected to be able to “click on and rename” didn’t smoothly operate as such (although I could control-click and choose rename), the window doesn’t easily or cleanly resize – so the app dominates your screen wether you want it to or not. On top of that, a queueing feature, and related window, for lining up a bunch of training was quite awkward to use and see the results as it was progressing, and after it was done. I had a couple training runs fail due to poor data, but the queuing window didn’t show any sort “well, that didn’t work” information – it just looked like it completed, but there was no training on those models.

The other awkward bit was dealing with messy data through the lens of CreateML. I can’t really say what it did was “wrong”, but it could have been so much better. In one of my experiments, the data I had chosen for training had issues within it: missing labels where the training system expected to see data. That’s cool – but the error was reported by a small bit of red text at the bottom of a screen saying “Entry # 5071 has a problem”. Finding entry #5071 of a structured JSON data set is, bluntly, a complete pain in the ass. When I’d parsed and assembled the data per the online documentation for making a word tagging model, I’d dumped the data into a monster JSON, single-line data structure with no line breaks. That made finding a specific element using a text editor really rough. In the end, I re-did my JSON export to include pretty-printed JSON, and then also used VSCode’s “json outline” functionality (scaled up, since it defaults to 5000 items), to track down the specific item by position in a list. I found the offending data, and then looking around, noticed a bunch of other areas where the same “partially tagged data” existed. In the end I dealt with it by filtered it out if it wasn’t fully tagged up. It would have been much nicer, especially since CreateML clearly already had and could access the data, if it could have shown me the samples that were an issue – and notified me that it wasn’t just one line, but that a number were screwed up.

The evaluation details after you train a model aren’t readily exportable. As far as I can tell, if you want to share that detail with someone else, you’re either stuck making a screenshot or transcribing what’s in the windows. It seems you can’t export the evaluation data into CSV or or an HTML table format, or really even copy text by entry.

The details about the training process, its training and validation accuracy, number of iterations used, are likewise locked into non-copyable values. In fact, most of the text fields feel sort of “UIKit” rather than “AppKit” – in that you can’t select and copy the details, only get an image with a screenshot. This is a “not very Mac-like” experience in my opinion. Hopefully that will get a bit of product-feature-love to encourage sharing and collaboration of the details around CreateML models.

I filed a few feedback notices with Apple for the truly egregious flaws, but I also expect they’re known, given how easy they were to spot with basic usage. They didn’t stop the app from being effective or useful, just unfortunate and kind of awkward against normal “Mac” expectations.

I do wish the details of what CreateML was doing behind the scenes was more transparent – a lot of what I’ve read about in prior research is starting with descriptions of ML models in PyTorch, Keras, or Tensorflow. If I wanted to use those, I’d need to re-create the models myself with the relevant frameworks, and then use the CoreML tools library to convert the trained model into a CoreML model that I could use. By their very nature, the models and details are available if you take that path. It’s hard to know, by comparison, how that compares to the models created with CreateML.

Creating a Model with CreateML Versus a Machine Learning Framework

My early take-away (I’m very definitely still learning) is that CreateML seems to offer a quick path to making models that are very direct, and one-stage only. If you want to make models that flow data through multiple transforms, combine multiple models, or provide more directed feedback to the models, you’ll need to step into the world of PyTorch, Keras, and Tensorflow to build and train your models. Then in the end convert the trained models back to CoreML models for use within Apple platform applications.

Where the raw frameworks expose (require you to define) all the details, they also inflict the “joy” of hyper-parameter tuning to train them effectively. That same tuning/choosing process is (I think) happening auto-magically when you build models with CreateML. CreateML chooses the iterations, while paying attention to convergence, iterations, and epochs of applying data. It also appears to do a good job of segmenting data for training, evaluation, and testing – all of which you’d need to wrangle yourself (and ideally not screw up) while making a raw machine learning model. That’s a pretty darned big win, even if it does end up being more of a “trust me, I know what I’m doing” rather than a “see, here’s what I did for you” kind of interaction.

Final Thoughts on Learning CreateML

I’m still very much in “the early days” of learning how to build and apply machine learning models, but CreateML has already been an immense help in learning both what’s possible, and providing hints as to how I might structure my own thinking about using and applying models within apps.

The first is simply that they provide a lot of good, basic models that are directly available to use and experiment with – assuming you can source the data, and that’s a big assumption. Data management is not an easy undertaking: including correctly managing the diversity of the sourcing, data licensing, and understanding the bias’ inherent within the data. But assuming you get all that, you can get good – mostly interactive – feedback from CreateML. It gives you a strong hint to answer the question “Will this concept work, or not?” pretty quickly.

The second is that showing you the outputs from the model makes the inputs you provide, and what you expect to get out, more clear. I think of it as providing a clear API for the model. Models expose a “here’s what the model thinks it’s likely to be – and how likely” kind of answer rather than a black-and-white answer with complete assurance. Not that the assurance is warranted with any other system, just that exposing the confidence is an important and significant thing.

If I were tackling a significant machine learning project, I’d definitely expect to need to include some data management tooling as a part of that project, either browser/web based or app based. It’s clear that viewing, managing, and diagnosing the data used to train machine learning models is critical.

I’m also looking forward to see what Apple releases in the future, especially considering the more complex ways machine learning is being used within Apple’s existing products. I imagine it to include more tooling, advances to existing tooling, and maybe some interesting visualization assistance to understand model efficacy. I would love for something related to data management tooling – although I suspect it’s nearly impossible to provide since everyone’s data is rather specific and quite different.

Published by heckj

Developer, author, and life-long student. Writes online at

%d bloggers like this: