Word clouds are a very attractive way of visually representing the relative importance of words. At a glance you can guess how important a word is relative to the others in the cloud. We will build two components:
- A text parser that will calculate the relative weight of each word in a given text (we’ll use some classical texts as source). The text parser will also expose a JSON REST interface.
- An AngularJS app that will get the JSON data and then pass it to timdream’s WordCloud component in a suitable manner.
The final result is like this:
You can find the complete source code here: https://github.com/MindFlavor/word_count_rust_angular2. Due to the nature of this blog we will discuss here the data management parts only. Our logic is pretty simple: count each occurrence of a word and use it to determine how big the word will appear in the cloud.
AngularJs 2 App
There are many open source JavaScript libraries that render word clouds, for this post we’ll use timdream’s one: http://timdream.org/wordcloud/. Timdream’s library comes complete with Typings which is helpful working with TypeScript. This time I will skip the steps required to startup a TypeScript project in Visual Studio Code. If you need more help please refer to my previous post: How to render SQL Server acyclic blocking graphs using Visual Studio Code, TypeScript, NodeJS and TreantJS – Part 1.
Service
We will start with the service. The service is responsible to retrieve the data (an HTTP GET) and to map it to the strongly typed class. The class for our word weight is a two-liner:
The service itself is:
@Injectable
attribute. It’s required so Angular knows that our service can be injected in our component. We are also requesting an instance of the Http
class in the constructor (via dependency injection). We can then use Http
member to get the JSON from our service.Using Promise
s, along with the TypeScript’s arrow operator, makes the code simple and easy to follow.
Component
The Angular component requests the service as prerequisite. It does it in the class decorator under the name providers
:
setText
method. Here, again, we exploit the Promise
in order to get rid of the async HTTP call:Now just inject the mock service instead of the real one and test the app.
WordCount service
The WordCount service should simply count the occurrences of every word in a given file. This is basically the textbook example of map/reduce. To spice the things up a bit we also want to:
- Exclude noise words (based on a dictionary).
- Consolidate similar words (singular/plural, synonyms, etc…).
We also want to create a multithreaded word count service. If we are CPU-bound this should help to speed up the computation on multithreaded machines. This kind of programs are – this example excluded – hard to code. Fortunately there are new languages built from the ground up to tackle the concurrency problems. One I particularly like is Rust (https://www.rust-lang.org). While targeted to systems programming Rust can be successfully used for other tasks too. In our case it forces us to think about ownership of shared resources, which is good for multithreaded applications (you can find more about it in this blog post https://blog.rust-lang.org/2015/04/10/Fearless-Concurrency.html).
Let’s recap the serial logic:
- Load a file. For each word:
- Check if is a noise word.
- Lookup the most relevant synonym (if any). If found use it instead of the original word.
- Increase the word’s count by one.
- Present the total per word ordered by occurrence count.
If we were able to apply the serial logic in parallel we could come up with:
- Load the file. For each word:
- Find an available thread and dispatch the word to it.
- Each thread will, for each word received:
- Check if it is a noise word.
- Lookup the most relevant synonym (if any). If found use it instead of the original word.
- Increase the word’s count by one.
- When all the words are dispatched and processed, present the total per word ordered by occurrences.
The data race
The parallel algorithm is good in theory but poses some immediate questions: “How do I dispatch a word to a thread?” or “How can a thread know when there are no more words to process?”.
The biggest problem, however, lies in the data race between threads. The step Increase the word’s count by one will be performed by multiple threads in parallel and, unless we mediate the access to the counter somehow, problems will arise. This is nasty because, in most languages, your code will compile and run even if you were modifying the same object from multiple threads. The result can be a segfault if you’re lucky or, way worse, miscalculations.
Bottom line you’re own your own. If you are a superstar coder you will prevent problems and end up with an efficient concurrent algorithm. But if you’re a normal person like me, chances are you will end up scratching your head at the pseudo random errors.
The good news is, with Rust, you cannot forget about such things. Rust will force you to prevent data races or your code won’t even compile.
Even the apparent innocuous concurrent access to a shared vector should be safe. With Rust it must be safe (thankfully). Let’s see how.
Concurrent access to a shared resource
Let’s try to share an immutable vector across threads. A first approach could be:
Here we declare a vector (v
) and then we spawn 3 threads that will print the vector contents concurrently. Seems legitimate? Well, it isn’t. Rust will tell us why:
work
(the spawning function). But if the spawning function (work
) terminates, what happens to the vector v
? Since the ownership of v
belongs to work
and not the threads, as soon as work
finishes the vector would be deallocated. And the threads would end up keeping a reference to dellocated memory. Bad.Transferring ownership
Rust’s new error system will suggest you to move (transfer) the ownership of the vector to the thread. It makes sense: if work
relinquishes the ownership of v
to the thread it will no longer deallocate v
. v
would get deallocated at the end of the thread (as it should).
Let’s try it (notice the move
keyword before the closure):
Will this work? No. Rust error is like this one:
The error this time is harder to interpret. The hint is in the note: v does not implement `Copy` trait
. Why Rust tries to copy our vector instead of simply transferring ownership? Because we are transferring the ownership of one vector to three threads. Since it’s impossible Rust tries to send three separate copies of our vector to our three threads. But since by default our structures cannot be copied it doesn’t work (luckily because it’s not something we wanted to do in the first place).
Go on and try to remove the for loop
and spawn a single thread instead of three. In that case the move will work.
Reference-counted pointers
Clearly moving the ownership of the vector to our threads cannot work without copying it for each thread. This solution is a waste of memory since we know our vector to be immutable. We should be able to access it concurrently without mutexes of any kind. But without copies we cannot move the vector and without move the thread will outlast the vector.
So, how do we solve the lifespan problem?
One answer is to use reference counted pointers. These wrappers will keep count of how many references are present at any time and only deallocate the inner object when nobody else is referencing it. It’s roughly what a GC-based language would do. Only here we get deterministic deallocation (and also the circular reference problem).
In our case it means the inner object will be referenced by the work
function and by each thread. So there are 4 references (1 for work
+ 1 for each thread). When work
terminates the reference count will simply drop to 3. And the vector won’t be deallocated (yet).
You can read more about it here: https://doc.rust-lang.org/std/sync/struct.Arc.html and here https://doc.rust-lang.org/book/concurrency.html.
That’s our final, working, code:
channels
.Channels
Channels are an elegant way to send messages between threads. Rust’s channels are special: they are also able to transfer ownership of entities between threads. Let’s see how with an example.
We have this super dumb function that returns a Vector:
join
method of the thread and retrieve the result:Putting all together
Now that we are able to share safely an immutable structure between threads and to send ownership using channels we can implement our concurrent word counter method:
This example of parallel map => parallel reduce => serial reduce is surprisingly fast and, last but not least, cross platform.
I’m sure better performance can be achieved further optimizing the code but I’ll leave it up to you :).
Happy coding,