Synaptic Knowledge – Making sense of Twitter

Twitter is a platform of over 340 million users, producing over 500 million tweets each day. Even just an insight into 1% of those tweets has the potential to provide a decent understanding into what’s happening in the world. If something is in the public domain, it’s on Twitter.

This post explores a technique to digest tweets down into a data structure that allows for user interaction, breaking story identification or even brand sentiment analysis.

The process begins by processing data from Twitter – for-which there are a number of approaches.

Data Processing

Data can be processed in many ways – two common to analytical processing are Batch & Stream Processing. At a high-level, the distinction is that with Batch Processing, the dataset for processing exists before processing begins. With Stream Processing, the dataset is not known ahead of time but instead arrives ‘bit-by-bit’.

Batch Processing

Batch Processing is the most common technique deployed for analytical workloads – perhaps each evening you want to take the days sales from your store and identify trending products, or perhaps you want to analyse the output of a collection of sensors following a rocket test to understand mechanical stresses that are felt across the vehicle. Batch Processing takes a defined amount of data as input at a specific time (t) and performs a series of actions upon it to create an output after-which processing ends.

However, sometimes we don’t want to wait for the entirety of the dataset to be available before we start processing it. Perhaps it’s not possible to have the entire dataset available prior to processing as the data does not yet exist. Regardless, the questions we ask of our batch datasets we could also ask to a more realtime flow of data – this is achieved through Stream Processing.

Stream Processing

Whilst not necessarily a new approach to data processing, Stream Processing is the processing of data whereby the dataset is not a known quantity. There could be 5 pieces of data to analyse or 5,000,000; 3 pieces of data could arrive each second most of the time, and at other times 5,000 pieces of data could arrive each second. Stream Processing allows us to process data as and when it arrives in a realtime manner.

In the rocket example, we analyse the sensor data as values are produced, not once the test has finished and all sensor outputs collated. This allows us to make decisions during the test as opposed to afterwards which could be useful if we’re looking to avoid an unplanned disassembly!

With Batch Processing, as the dataset is known ahead of time, the input can be split and assigned to compute resources ahead of execution – an execution plan can be created (if interested, read into MapReduce). With Stream Processing, we’re a lot more reactive and as such these architectures can often seem more complicated. However, as you should see in this post, that isn’t always the case and shouldn’t put you off.

Given we want to process tweets in realtime, it seems we need to implement a form of Stream Processing to meet our requirements.

Twitter Streaming

It turns out Twitter have an API to stream a 1% sample of tweets – the question then is given this information, how do we make sense of it?

Extracting Knowledge

I wanted to focus on two core elements when processing tweets – relevance meaning exposing words that ‘mean something’, and confidence meaning how relevant words come together to confidently outline a story.

I’m no linguistic expert, but let’s work through an example:

Fantastic goal from Mane this evening

From this tweet, the words ‘fantastic’, ‘goal’, ‘mane’ and ‘evening’ are relevant to understanding what’s happening. The words ‘from’ and ‘this’ whilst meaningful, are arguabley not as useful for my use case, they’re typically known as stop words. Furthermore, these extremely common words will be found in many tweets not relating to a goal scored by Mane, so it’s probably best we discount them in our analytics to avoid noise.

Secondly, confidence. If one person tweets that Mane has scored are goal, are we confident that he has? Probably not. If 50 people tweet Mane has scored a goal, I’d argue it’s likely that he has. This is the approach I have taken. There are obviously other techniques such as trusting some Twitter accounts more than others – much like how backlinks work within search engine indexing algorithms.

Correctness is also worth a mention, particularly in todays world. It’s not something I’ve tried to guard against in this piece of work as my primary goal is not to present correct information, just information that’s ‘trending’ on Twitter (at a level of detail that does not rely on hashtags).

Once we’ve received data from Twitter, we’re going to need a data structure to support our use case so we can programmatically record relevance and confidence.

Synaptic Graph

Again, that common data structure, the Graph, provides the mechanism to store the analysis. A visual example can be seen below.

The boldness of the words depicts how often the word is mentioned in tweets and the lines indicate an association between words that meets the given confidence criteria.

You will be able to see some stories in the above graph, but let’s look at some examples in further detail.

Examples

Unfortunately, the week of testing was not a particularly great one for the world and so I apologise for using such sensitive events in my analysis.

Nice, France Attack

The recent events in Nice, France appeared in the analysis. Initially ‘nice’ and ‘attack’ became apparent on the graph, swiftly followed by more details of what was happening on the ground as people began to tweet.

You can see from the boldness of the text that we’re pretty confident there’s been an attack in Nice, France and that the Police are involved. Details are emerging that it could be terrorist related and that the police are associated with a shot. However this exemplifies an issue with this data structure – it appears the police have shot someone dead.

It may be that early tweets were suggesting that the police had shot someone dead and the correctness issues outlined earlier becomes apparent. Or perhaps the tweets just contain information about the police attending an incident where people had died and the police had fired shots. The graph records useful, relevant information, but it isn’t a source of truth.

US Election

As you would expect, the US election is accounting for a large quantity of tweets at present.

Labour Party

The recent EHRC report into the Labour Party reported that Jeremy Corbyn was suspended from the party.

Depending on the tweets provided within the 1% sample stream, you can end up with separate graphs which whilst related in the real world, have not yet been connected through analysis. This can be seen opposite. As processing continued, a connection was formed between these two graphs.

Conclusion

I mentioned at the start that Stream Processing doesn’t have to be complicated – this proof of concept used client side Javascript and Google Chrome to open up a persistent HTTP connection, processing 70 tweets per second. If you’re trying to solve a data analytics problem, don’t feel it’s out of reach and you need to stand up an Hadoop cluster. Start small and you’ll be surprised at how much you can prove and achieve.

For me, this project will continue and I’ll report back on future versions. My efforts will focus on weeding the graph of old news over time, refining the deletion of stop words and perhaps overhauling the UI altogether. If you have any ideas, please let me know.

Creating a LISP-like Interpreter (Introduction to ANTLR)

Whether trying to understand natural language processing or the intent of some software written in a given programming language, understand language syntax is a key Software Engineering challenge. Understanding the structure of a language is achieved through an understanding of the language grammar. But what is a grammar?

the whole system and structure of a language or of languages in general, usually taken as consisting of syntax and morphology (including inflections) and sometimes also phonology and semantics.
Google Dictionary

The Parse Lisp Expression LeetCode problem touches on this through a challenge that requires you to parse an expression that conforms to a given syntax and execute it – this is commonly known as interpretation. JavaScript is an example of an interpreted language, but even then, that’s often debatable. If we’re not executing machine instructions directly on a processor that correspond specifically to the given input, we must figure out what the intention of a certain input is, and then use instructions that are available for execution on the processor (from the host environment) to execute the intent; due to this additional interpretation step, interpreted code is therefore inherently slower than ‘native’ compiled code.

This post first explains the solution, but then touches on a grammar parsing tool, ANTLR, which offers a more robust, logical approach to language parsing. ANTLR was not used in the solution to the LeetCode problem as external libraries cannot be used in solutions.

Approaches

In solving this problem, I considered two approaches:

regular expressions, and;
iteration and / or recursion.

I intended to solve this problem using JavaScript; that constraint influenced the implementation approach. Specifically, the parsing of brackets within a statement was not possible with JavaScript’s regular expression handling ‘engine’.

(let x 2 (add (let x 3 (let x 4 x)) x))

For example, in the above statement, JavaScript regular expressions (which themselves, are interpreted) are limited in their ability to support look-aheads and look-behinds, meaning that the nested brackets in this statement cannot be understood correctly.

I therefore had to use an iterative / recursive solution.

Solution

We know from the problem description that the statements fit into the following syntax; it has the following grammar:

add <expr1> <expr2>
mult <expr1> <expr2>
let (<v1> <expr1>)* <returnExpr>

The approach to solving this problem is to read words (tokens) from left to right, executing as and when required. Note how brackets are parsed – they introduce the concept of a context which is explained below.

(let x 2 (add (let x 3 (let x 4 x)) x))

Parsing of the above statement will:

Recurse the expression (let x 2 (add (let x 3 (let x 4 x)) x))
1. Assign x = 2
2. Recurse the expression (add (let x 3 (let x 4 x)) x)
  1. Recurse the expression (let x 3 (let x 4 x))
    1. Assign x = 3
    2. Recurse the expression (let x 4 x)
      1. Assign x = 4
      2. Return 4
    3. Return 4
  2. Return 4 + 2
3. Return 6

The important thing to ensure in the implementation is that when the context changes (i.e. you encounter a left bracket), the state of any variables within preceding let expressions are assigned. For example, before executing the add statement in the above, x is first assigned the value of 2. It’s also important that the current context is remembered such that upon returning to the source context, it can be restored.

Improving the Solution

There are a number of areas wherein the solution could be improved; it was by no means the fastest JavaScript solution on LeetCode (only outperforming approximately 80% of other JavaScript solutions in terms of time complexity). There are two primary areas where performance could be improved:

Assign variables more efficiently when executing the interpreted statement – as explained above, when executing an expression within a let expression, the ‘context’ must be set such that any variables used within the sub-expression are available. The inefficiency in the solution is that all variables are assigned (and sometimes multiple times) every time a sub-expression is reached, even if they haven’t changed.
When executing a sub-expression, the state of any variables must be sent to that sub-expression as the context, however, upon returning to the initial expression, the context must be ‘reset’. As JavaScript passes objects by reference, we cannot pass the same object around as assigning a value to an existing property would overwrite the original value, or context. JavaScript does not have a straightforward way of cloning an object (especially deep cloning an object) – the approach taken was to convert the object to a string and then back into an object… obviously inefficient.

But what if we’re working with a complex language – something whereby we need a robust approach to parsing the grammar. This is where ANTLR can be used.

ANTLR Grammar Parsing

What is ANTLR?

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files. It’s widely used to build languages, tools, and frameworks. From a grammar, ANTLR generates a parser that can build and walk parse trees.
antlr.org

In the context of this problem, ANTLR is used to parse the input statement into a structure that software can easier interact with than just a string of characters; the structure is a graph known as a Parse Tree. Graphs form the solution to a number of problems on my blog; understanding graphs is vital for any Software Engineer.

There are three stages to execute an interpreted statement, they are:

Parse the syntax;
Validate the semantics, and;
Execute the statement.

Parsing the syntax turns the ‘program’ into a navigable set of ‘tokens’ in the form of a parse tree. This enables the semantics of the statement to be validated (i.e. are we adding to a variable that does not exist). Given both the syntax and semantics validate successfully, the parse tree can be executed.

The parse tree for the example statement above resembles the following:

You can see from this tree structure, the walking algorithm follows a depth first approach, flowing from the left most leaf node to the right.

So how does ANTLR generate this tree structure? The answer is through a grammar definition, executed against some input program – the grammar to parse the LISP expression syntax as outlined in the LeetCode problem description can be found below.

grammar Expr;		
prog:	expr;
expr: '(' 'let' ' ' (VAR ' ' (expr|VAR|INT))+ ' ' (expr|VAR|INT) ')'
	| '(' ('add'|'mult') ' ' (expr|VAR|INT) ' ' (expr|VAR|INT) ')'
        ;
VAR: [a-zA-Z0-9]+ ;
INT: [0-9]+ ;

The grammar is relatively straightforward to understand – it follows a similar syntax itself to regular expressions but is known as Backus–Naur form. The program is made up of an expr, for which there are structurally two types of expression, a let and an arithmetic expression (add or mult).

Interestingly, in order to parse the grammar file provided, ANTLR itself will utilise a defined grammar to parse the input grammar.

You can therefore see how defining a grammar with a tool such as ANTLR provides for a much more robust environment when compared with the initial solution explained earlier in this post.

Conclusion

Often, your implementation won’t be impacted by the chosen programming language. Certainly, whilst the computational complexity of a solution may be the same in two ‘languages’, the time complexity may differ wildly; something you may want to consider given your non-functional requirements. Solving this problem did however introduce a language / run-time limitation in JavaScript’s support for regular expressions which is not as advanced as say Java’s.

Finally, learn. The complexity (effort, cost, computational, etc.) of a solution should always be proportional to the problem – but that’s not to say you shouldn’t be curious as to how else a problem can be solved. Understanding how tools such as ANTLR work give you, a Software Engineer, another tool in your kit – opening up alternative ways to solving problems in future.

LeetCode – Max Points on a Line

Introduction

Max Points on a Line is the third least accepted (against submitted solutions) problem on LeetCode; it is also the second hardest problem in the ‘hard’ category. At first glance, the problem seems simple – from a set of points, determine the longest straight line that can be formed from those points and return the number of points in the subset. However, as the saying goes, if you assume it makes an ass out of u and me.

This is my write-up detailing the approach I took to solving the problem and the issues I encountered along the way. Lessons learnt can always be taken from one context and applied to another – hopefully you find this useful.

Approaches

I explored two options when it comes to solving this problem, both explained below. An ‘agricultural’ approach which I quickly abandoned, and a ‘formulaic’ approach that did work but encountered some interesting issues.

Whilst I won’t be posting my solution in this article, you can find my implementation on GitHub (linked in the Social Banner above).

Agricultural

Having forgotten some high school mathematics – my brain instantly went trying to understand how an algorithm based on scanning and or recursion / iteration could solve the problem. Could I determine the ‘step’ between two points and then for each remaining point, see if I continue the step pattern whether or not the point in question would be reached.

The algorithm quickly became complex – you would obviously need to check the step in both positive and negative directions (reducing performance), you may want to start finding the point in the line that would be closest to the point your analysing to slightly improve the performance, etc.

Good problem solving doesn’t mean coming to the correct solution first time – it’s also recognising when an approach isn’t working and looking for another way. y = mx + b.

Formulaic

The straight line equation in its well known form will determine y for a points associated x where the slope (m) and y-intercept (b) of the line are known.

My initial approach was to take two points in an existing line (any two points can be the start of a line) and determine their slope m = ( (y1 – y2) / (x1 – x2) ). The remaining variable to determine was the y-intercept which with the formula rearranged stated b = y – mx. I knew the slope from the previous calculation and I know x and y from an existing point so I could determine b, the y-intercept. The final step was to determine if for the given x of the point being analysed, given the y-intercept and the defined slope, did the resulting y == y of the point being analysed.

The approach worked – I went from passing 50% of the test cases using the agricultural approach to passing approx. 90%. But what was causing the final 10% to fail? Precision.

In day-to-day life as a Software Engineer, precision doesn’t usually impact you unless you’re working on financial systems, sensor based system, and other similarly complex, numerically focussed applications.

My solution was based on JavaScript which by default represents all floating point numbers as 64-bit floats. This means a number with a fractional component can be represented up to 17 decimal places, and even then may not be accurate. Both of these limitations caused issues.

The following test case enabled me to identify the issue:

[[0,0],[94911151,94911150],[94911152,94911151]]

Whilst the final two points are very close to each other, when on a slope with 0,0, all three do not form a straight line. If we try to determine the slope of the line containing the first two points, we would calculate -94911150/-94911151 – the answer according to the JavaScript engine is 0.9999999894638303. To determine the slope between the first and third point, we would do -94911151/-94911152 – the answer according to the JavaScript engine is 0.9999999894638303. Brilliant! All three points are on a straight line! But they’re not…

Use a high precision calculator and you’ll see that -94911150/-94911151 = 0.999999989463830230022181482131641201991112719726684170124541003617161907 and -94911151/-94911152 = 0.999999989463830341033053734296682016882483946670460811601991723796588202.

You can see around digits 16/17, the answer begins to vary… the slope is not the same. However, due to JavaScripts (although this problem isn’t limited to native JavaScript) precision limit and accuracy issues (i.e. rounding) mean that for this problem, I am unable to correctly to determine the slope. So, how do you solve the problem?

Initially I knew I had a precision error, but I wasn’t sure whether it was determining the slope, y-intercept, and y value for the cross-comparison. My first thought was do I need to determine the y-intercept and the y value? Surely if two points are on the same slope as two other points on the line, we’re good. Having removed the other elements, I noticed the precision issue above.

My immediate thought was that by using 0,0 as a comparison point, we were working with an answer for the slope that was extremely precise. If another point was used, the result may not require the same precision. If points 2 and 3 form the input were used to determine the slope, the result would be 1 (-1/-1). As 1 != 0.9999999894638303 – we pass the test case. This solution worked for all test cases and the ultimate acceptance of the solution. However, it’s obvious to see for test cases that result in a similar precision no matter what points are selected, the solution would encounter problems.

Analysis

LeetCode provides some great solution analytics – at the time of solving the problem, my solution execution time was 79% faster than other JavaScript solutions. My assumption is that that these solutions may have taken more agricultural approaches.

There are a few key takeaways from working through this solution:

All programming languages have limitations (of which some can be solved with third-party libraries – in this case, several BigNumber libraries). Whilst they may not impact you day-to-day, if you’re encountering ‘strange’ issues, they’re good to have in the back of your mind. Worse, often the runtime won’t throw any errors to tell you it has automatically rounded a result or is not accurate past a certain point.
Formal Specification techniques are used in critical systems to at runtime ensure code is behaving according to some mathematical constraints. Whilst formal specifications may not be necessary for all projects, there’s nothing wrong with having code to check the validity of results. For example, if adding two decimals both with two digits following the decimal point, the result won’t have more than two digits following the decimal point. If it does, you know something has gone wrong.
Understand what the minimum required steps are to solve the problem – my initial formulaic approach trying to understand the y value was adding complexity to the solution and making it harder to debug. I only needed to calculate the slope to solve this problem.
Another bug I encountered was due to not fully understanding the specification – my assumption (the second time I assumed wrong in solving this problem!) was that the points had to be consecutive according to the step function. This wasn’t the case and was not clearly stipulated in the problem description.

Conslusion

This problem was great fun to solve. Day-to-day in the workplace, my clients are using software in ways were these sorts of problems do not manifest themselves. But they exists – it’s important as a Software Engineer to be aware of them and to challenge yourself in ways that you may not be challenged in your day job (not that it lacks its own unique set of challenges!).

Apart from the usual cliches such as testing is only as good as the test cases and not fully understanding requirements is the downfall of many IT projects, I will take two ideas from this exercise:

Testing can be built into your code – if the result of performing an action is that n number of things should be in a list (without knowing the contents), add some code to check that n things are in the list. If not, an issue must have occurred (either data or code)
Understand what the minimum required steps are to solve the problem – initially you may just want to solve a problem, with the result not being the most elegant solution. Even if you don’t encounter issues, refactoring is important. It results in less instructions, potentially leading to less bugs in future; for those bugs that do occur, the solution will be easier to debug.