Signs, Sigils, and Syntax

Why do we call programming languages “languages”? What was missing from the missing literary traditions of programming was detail about syntax and style. As a consequence, I think this left some significant linguistic questions hanging in a rather ambiguous way.

Aside from semantics, which I will not discuss here, my view is that there are two polar concepts that form the core of what a language is:

Syntax: In a general sense, syntax refers to the rules that govern the structure of statements in a language. Statements are expressed through a grammar which determines meaning. Lexical categories (parts of speech like verbs, nouns, etc), are the elementary units of syntax, which is usually described as an arrangement of rules for forming phrase structures out of these parts. Syntax is pure structure - Chomsky famously demonstrated the gap between syntax and semantics, by showing that a sentence could be grammatically correct while having no meaning. It's no coincidence that this formalization of grammar categories occurred at the same time as the emergence of the first high-level programming languages.
Metaphor: In many Western schools, young people are taught that language expresses literal meaning in statements and there exist particular 'poetic devices' used to convey more abstract and intuitive meaning using comparison - similes and metaphors. But when we look more directly at the way language operates, we find that the use of metaphor is so deeply embedded that even supposedly literal meaning hinges upon it. Our thought process has become so 'languaged' as it were, that we tend not to notice the way we layer meaning by reference to other things. Metaphors are the recursive building blocks of language, and can usually be traced to physical origins, which perhaps arises from the evolutionary tendency of language to reproduce communication derived from sensory experience. Metaphors can be traced along a continuum of abstraction which flows from concrete towards abstract - it seems that the origins of almost all adjectives in the English language can be traced to an earlier more concrete usage. More than just a useful device for expressing comparison, metaphor turns out to be a primary aspect of linguistic evolution. This is because the root of all meaning is our basic experience of physical forces in the world.

Where programs deviate from written texts is in the way that they express meaning. The semantics of a program relate to how it runs on a machine, how it operates through time (sequential instructions) and space (memory). Syntactically in a program, lexical placeholders can be exchanged and rearranged in any given way that is grammatically correct, but most significant developments in the history of programming languages have moved away from direct contact with the mechanical foundations of computer hardware, towards more abstract and expressive models that reflect human thought and communication. While it's old hat to say "the machine doesn't care" and will accept any syntactically valid progam, in actuality, nobody but the most pathologically incompetent or misanthropic programmers really think that way. Many programmers believe that good code is that which concisely and effectively expresses intent.

These following blocks of peseudo-code have identical syntax and might do exactly the same thing in their technical operation. Which of these is more expressive? Which communicates more information about what it does?

if (SIGNAL_1123 equals STATE_AA) then EXEC_CMD86(OBJECT_X334)

if (TRAFFIC_LIGHT_SIGNAL equals RED) then STOP(CAR)

These statements will compile to an identical program.

So code is written for people to understand as much as it is written for machines, and we cannot ignore that this process of understanding involves the aesthetics of how a particular piece of code sends signals to the mind of a reader. Syntax in programming languages is about more than just conforming to a phrase structure – it operates visually, providing information from its overall shape and flow.

The visual nature of syntax is interesting because it has had a significant influence on the design of programming languages. Perl and Ruby in particular, borrow the concept of sigils from runic and hieroglyphic cultures, using lexical markers as a way to economize the communication of variable scope. For example, in Ruby, variables can be prefixed by a sigil that defines their scope:

@obj    # instance scope
@@obj   # class scope
$obj    # global scope

Sigils are one of the reasons why Perl has often been described as “executable line noise”. It’s not immediately obvious that this comment is really as much to do with typography as it is syntax. Certain lexical choices are embedded in Perl's design specifically to express programs using a kind of visual shorthand that may be anathema to the users of many other languages. In general, Perl's grammar is based on the same structures as these other languages, but the influence of sigils tends to lead to Perl programs having a qualitatively different style and structure.

The apotheosis of this style is found in the APL programming language, which was designed to be a symbolic system for manipulating vector algebra. This unique and fascinating language has had a wide degree of success in heavily mathematical domains. APL eschews the Algol block structure, and deploys a huge array of very specific operators, each represented by a custom glyph in the language's alphabet. Because these glyphs are non-existent on a standard ASCII keyboard, traditionally, APL programmers had to work with a custom keyboard designed specifically for the language. When it comes to working with numbers and algebraic expressions, this symbolic approach is optimal, but it would fail considerably in heavily text based environments - such as web applications - where programs mostly push data around rather than numerically manipulate it, and thus need to take on a much more descriptive, and narrative form.

We can contrast calls of “executable line noise” with another cliché, that “Lisp has no syntax”. This claim is made largely because the textual structure of Lisp programs is an exact serialization of the nodes of the program’s parse tree. Yet the textual format of Lisp— without sigils or the restrictions of a larger set of identifier abstractions in the grammar—requires that programmers themselves build up the syntactic forms and idioms to represent a particular domain. Many more recent (and so-called ‘higher level’ languages) provide these syntactic forms and idioms as pre-packaged constructs. Other languages are more restrictive than Lisp, yet their lexical forms are usually focused directly towards English speaking context and standard literary conventions from mathematics which feed directly into intuitive reading of the code.

Conventions from arithmetic and algebra are an area where imperative lexical forms can be quite happily in tune. Nobody would question the result of evaluating (4 + (2 * 2)). But free of ambiguity, this syntactic form in prefix notation is + 4 * 2 2. Certainly simpler, but also more abstract and harder for many people to grasp at a glance because we have learned arithmetic through counting and combining. "Two then three then four...", "One plus one...", "Two times two...". In speech, we emphasize the verb, and in mathematics the operators have clearly defined associative and distributive laws, which is information signaled by the symbols of parentheses and each operator acting on left and right operands. Yet people believe the opposite is correct. To my mind, these views are too narrow. When considering the associative law in the context of programming languages, we shouldn't forget the influence of mathematical literature or English verb order.

There is a tension between syntax that reflects actual program structure, and syntax that reflects typographic symbols that communicate meaning and semantics. So it's not literally true that Lisp has no syntax, just that Lisp gravitates towards program structure at the most extreme end of this continuum.

We can look at a hypothetical function describing a simple desk lamp in a fantasy Lisp dialect, assuming that we have actions power-source-on and power-source-off that send messages to the power supply:

(def lamp-switch (input)
  (if (= input :on) (power-source-on) (power-source-off)))

Object oriented languages have been successful because they provide language level support for modeling entities to which we can ascribe a metaphorical significance. In its basic essence, the Kingdom of Nouns is a very powerful concept. Here, we see the same state machine with two transitions and two possible states, with the standard OO syntax giving us scaffolding to describe this as a discrete object that has internal state:

class LampSwitch {
	private PowerSource power = new PowerSource();
	public void switchOn() { power.on(); }
	public void switchOff() { power.off(); }
}

This is the simplest thing that could possible work in Java. It seems logical to organize the code around the syntax of representing the real-world thing as the LampSwitch object, but there is a subtle catch here. All of a sudden - perhaps without realising it - we become concerned more with details about the architecture of our code to support the language's idea of an object model, rather than our program's unique model. We are forced to infer what the designer of the code intends from what the language demands. The syntax forces an impedance mismatch. The visibility declarations and return type identifiers (public and void) force us to think inside the language, instead of inside the program.

Nowhere is this more obvious than with Java's idea of primitive types. In the 90's, perhaps few programmers would have accepted a garbage collected OO language with C-like syntax if it didn't have "close to the metal" primitive representations of integers, doubles, floats, and arrays. Yet so many of the other features introduced into the Java type system—especially generics—deliberately push programmers towards thinking of every reference in their program as a full blooded object. It's not so simple to introduce such major changes to a language when backwards compatibility and previous habits linger (often written in textbooks, and enshrined in course materials). The grammar of Java thus becomes a strange amalgam of historical detritus and cultural forces. I'm not trying to bash Java here, just illustrating another twisted dichotomy of why programming languages are so hard to understand. Syntax, if anything is schizophrenic and elusive – it's hard to pin down.

There are issues of style at play here too. Lisp stylists believe that hyphenation is more inline with English readability than underscores or camel case naming. Object zealots believe that method names should provide an expressive message about an atomic action that the program performs. Like any organic discipline, these are best understood as rules of thumb than exact instructions. Programmers who have mastered the skill and have an understanding of code style know where and how to bend these rules and to break them in the service of producing a more complete and cohesive creation.

The hardest thing to explain is why these qualities mean that programming is both a form of literature, and a form of architecture, yet also completely unrelated to either. It doesn't help that this architectural comparison is one of the most popular forms of misunderstanding in existence. To be fairly educated in such matters requires exposure to the dirty little secrets of this discipline:

Fortunately architecture benefits from its history. Students have to learn plenty of materials science and mechanics, but it is still taught as an Arts subject. The aesthetical judgement - the ability to juxtapose the elements of a problem and composition and see beauty or its absence - of the students is developed. This aesthetical sensibility is shared by great programmers (who use words like elegance to describe it). So I argue that progress in this area will most likely be made by looking at the cognitive state of the practitioners, not the shelfware they execute.

Some writers assert that programming and computer science strives for an eternal present, and rejects history but I don't think this is quite correct. More likely, I think, is that computer science doesn’t yet know what it actually is, nor what it should be. This is why software needs philosophers. The importance of the connections to literature and the semiotic relationships that underscore programming languages are neither widely, nor well understood. We have yet to adequately grapple with this realization that our formal models of syntax are vastly insufficient to describe the overall aesthetic and symbolic creativity of programming languages.