Google

A quick presentation I gave to my CS 239, Spring 2014 on my early thinking around deriving programmer intent from source control and program evolution.

Published

07 Apr 2014

Tags

The relational model for data is ubiquitous. That's in part due to SQL's declarative approach to manipulating and exploring data stored as relations. Unfortunately SQL has its warts. In particular schema changes made in the data definition subset of the language (DDL) [1] can be awkward for creating idempotent migrations. Enough so, that the responsibility is frequently delegated to the application layer where more expressive languages can be employed. In this, the first of two posts proposing improvements to SQL, I'll lay out an alternate semantics for SQL DDL that embraces schema change and expands the expressive power of DDL's declarative core.

A Common Activity

To illustrate how schema changes break the initially declarative semantics of DDL, lets look at an example:

create table foo (
  bar int,
  baz text
);

All tables start this way. The only piece of syntax that might otherwise alert a new user to the fact that this is not an entirely descriptive declarative language is create. The definition is very much "what" is desired and not "how" to get it. This breaks down when anything about the table needs to change:

create table if not exists foo (
  bar int,
  baz text
);

alter table foo add column bak text;

All together this will ensure the proper end-state whether the target database is at the initial state without the table foo or at the second state where foo lacks the column bak. In this case it's easy to understand the final state of the table because the example is very simple, but it's acquired an imperative pall with the inclusion of the first alter. As the schema definition grows more complex through many drop, add, and type cast changes, the final state of the table becomes less clear:

create table if not exists foo (
  bar int,
  baz text
);

alter table foo add column bak text;
alter table foo drop column bak text;
alter table foo add column baks text;

It would be better to simply add columns to the original table definition, and then the shape of the resulting table would be immediately clear at a glance.

Differential Semantics

In our toy example the desired table definition included a new column bak. An entirely descriptive update to the table declaration would look like this (Note that the original alter statement is absent):

create table foo (
  bar int,
  baz text,
  bak text
);

Unfortunately the SQL runtime considers the syntax in isolation and makes no attempt to reconcile that with it's internal representation. That makes perfect sense because a user is permitted to run small, ad hoc snippets in addition to full schema migration scripts. That is, the RDBMS can't know where this declaration is coming from nor why it's being run so it's unsafe to assume it should do any reconciliation. In contrast a well outfitted user can provide exactly that information.

@@ -1,4 +1,5 @@
 create table foo (l;
   bar int,
-  baz text
+  baz text,
+  bak text
 );

Looking at the diff, it's clear that the intention is to add the column bak to the table. What's required then, is to assign some semantics to this diff. With that established a simple pre-processor could map this differential to the corresponding alter statement in DDL, namely the original alter statement.

The key insight here is that we can permit schema migrations while retaining an entirely descriptive declarative syntax by appealing to the differential information available via source control tools.

Value Proposition

The basic value proposition is reduced cognitive overhead when maintaining schemas using SQL DDL. In addition, DDL's syntax is reduced by about half because alters and drops [2] can simply go away which should make it easier to learn [3].

This could also be pushed up the stack to migration tools by an enterprising library or framework author. For example, Rails generates and maintains a db/schema.rb file that is supposed to represent the state of the schema for the associated migrations. A similar technique could be applied there to divine the appropriate alterations when an change to that file is made in place of using migrations for schema changes.

Finally, by associating meaning with syntactic change the user can more safely understand and execute post commit reverts to schema changes. That is, instead of manually defining the necessary steps to "undo" some previous schema change, the source control system can provide the exact information that is necessary.

Pitfalls

Obviously, not every migration is just about the schema. Frequently the data has to be altered to conform to the target schema. This is actually an area of active research in the Database Systems community [4].

Conclusion

For the interested reader, I started work on a preprocessor implemented in Haskell. Unfortunately since I don't have any plans to pursue this further I haven't been working on it. Also, for comparison I've included two very simple sets of denotational semantics in the footnotes; one to represent the current implementation and one to represent the differential semantics [5]. They highlight the symmetry of this new approach to the language when compared with the current implementation.

This technique can be extended to other languages that manage system state declaratively like configuration management DSLs or even HTML. Though in the case of configuration management, understanding the mapping between syntax and state is quite complex because system components frequently generate artifacts that are not explicitly declared.

Broadly, the idea of differential semantics is to gather more information about intent from readily available sources so that language runtimes (declarative or otherwise) can make more informed decisions about user intent. The results need not be confined to accurate interpretation of the desired system state.

In the next post we'll look at how a type system applied to SQL might provide some useful safety properties during schema migration and beyond.

Footnotes

  1. http://en.wikipedia.org/wiki/Data_definition_language
  2. In our example a drop would be accomplished by removing the table definition completely.
  3. The mapping presumes feature parity in the create with the alter statements, but in my study of the standard and Postgres' implementation this appears to be the case.
  4. There's a lot of interesting work and tooling around preventing issues resulting from schema migrations: schema evolution.
  5. A denotational semantics for both the current DDL semantics and the proposed semantics. Note that in the proposed section the "differential" semantics eval function is parameterized by the state of the syntax.

Vote/Comment

Published

04 Nov 2013

Tags

The 26th is the first day of instruction in the first academic year of my PhD in Computer Science at UCLA. I have a two year old daughter, an incredibly supportive wife, and I just turned 30.

Unsurprisingly, I've been asked many times why I want to get a PhD and I rarely get a chance to explain my thinking fully. This post is both a detailed account of my reasons and a counterpoint to the popular opinion that there isn't much value in higher education in Computer Science.

Why Bother?

Any confusion or surprise over my decision to go back to school can generally be reduced to a simple idea: the success I would otherwise have while getting my PhD is of greater absolute value than the education.

This comes in many reasonable forms. Why are you leaving such a high paying job? Will you be able to speak at conferences? When will you find time to contribute to open source? Couldn't you just work on those ideas in your free time? These questions are asked directly and without implication.

In some cases there is a subtle suggestion that academia and higher education are losing their value. Why would someone pursue a PhD when there are people without any higher education earning comparable salaries and working on important projects? I never hear this asked explicitly, but it comes through in many conversations and the idea that education isn't important is pervasive in the software "meritocracy".

In each case, I'm consistently left wanting more time to describe my thinking and the personal experiences that landed me in a PhD program.

Career Path

When I got my first legitimate management position as the director of engineering at a consultancy/incubator it quickly became clear that management wasn't for me. I enjoyed seeing my teammates succeed. I enjoyed building and refining Process. I enjoyed winning business. I enjoyed the subtleties of communicating complex technical ideas to people with all types of backgrounds and experience. By my own estimation I was even pretty good at all those things.

The thing that got to me was the amount of time I spent staring at my email client. After a year I missed the technical aspects of the job, and I started to think about my long-term career options. At the time I saw three paths that didn't route back through school.

  1. Avoid titles and stick with engineering This was, at least initially, the path I took. I left for Adobe and a position working full time on jQuery Mobile.
  2. Grab the title and climb the ladder Management positions pay a lot of money, and there's a lot of room for advancement. Not all require you to hand over your keyboard.
  3. Employee Number One or Co-Founder First employee/Co-Founder positions are available for experienced "generalists" as far as I can tell. This is purely based on personal observations and conversation with friends who are better informed on the topic.

Of course, there are a lot of subtle variants to each of these, but this was my assessment. I can say with confidence that my move to Adobe put me in an ideal engineering position. Working on an OSS project full time gave me the freedom to pursue long term solutions to difficult problems and I still had access to the good parts of a large corporate support infrastructure.

Unfortunately, the problems I gravitate towards are not normally assigned to engineers and attempts to marry my interests with my domain of expertise [1,2] have met with understandably tepid responses. The ideas are hard to follow in short presentations, most people don't have time to read marathon blog posts, and the underlying work isn't funded by my employer. All of that makes perfect sense but it means I can't focus on the problems that I'm interested in.

What Interests Me

The things that keep me up past midnight working and learning are technical. Problems both large and small that remain unsolved or problems where the solution seems unsatisfactory to me. A short and incomplete list in no particular order:

  • Descriptive declarative languages that aren't
  • Correctness in dynamic languages
  • System configuration and state management
  • Adoption, understanding, and value of advanced type systems
  • Direct applications of mathematics to programming languages
  • Semantics based approaches to hot code loading
  • Preventing language bugs/quirks during language creation

A few of these might align with a job posting somewhere but most don't. More importantly, I'm not as interested in the implementation as I am in conceiving a solution and understanding its value. I'm sure that will sound like laziness to some, but the part I enjoy the most is thinking through the idea. Most of the time an implementation is an important part of that, and I enjoy hacking as much as anyone, but sometimes it's not. Sometimes a more abstract representation of the problem is the best way to predict the relative value of a solution before committing time and effort to an implementation. Again, this isn't the type of work that engineers get funded to do.

On Academia

Perception of academia varies depending on the field and an individual's cultural disposition. In the software community there is a certain sense of respect for the research and science on which our careers are built. People have even sort of deified greats like Alan Kay, Alonzo Church, and Alan Turing.

More and more though, the word "academic" is being used as a pejorative to mean irrelevant, unimportant, or a wasted effort. It's easy to wave this away as an issue of sloppy semantics but I think it highlights the suffering perception of academics. In my experience working on research projects and doing a lot of reading over the last few years, much of the stigma appears to result from two issues. The first is, what people think research should be for, and the second is the occasionally impenetrable nature of formalism.

Computer Science research is frequently accompanied by proof of concept software. It might be poorly constructed, hard to get working, or even hard to find. In turn, that can lead to a poor opinion of the research and researchers, but the implementations are rarely the primary contribution of the work. The goal of the researcher is not to provide implementations or information directly to industry, but rather to produce a solution to an outstanding problem. The work of translating that solution into something "well built" or even concrete is frequently left to the reader.

Unfortunately, reading Computer Science papers can be hard work. The deluge of notation can be discouraging. It can even seem like unnecessary ceremony, but a lot of ground has to be covered in the limited space provided for conference publications. Moreover, formalism and logic are the most important tools we have when streamlining and finding consensus on a good solution. In a perfect world all the necessary context would be easily accessible, and the formal tools used to establish properties of a solution would be easy to understand. Sadly that's not the case, but it doesn't diminish the value of the "encoded" contribution.

It's because of these things and not in spite of them that I'm going back to school. I want to work in an environment that explicitly promotes a focus on the fundamentals of a problem and requires that the utmost care be taken when presenting a solution.

The Fourth Path

I guess you could say I'm taking The Fourth Pathâ„¢.

While I may be new to the academic environment, I have taken efforts to ensure that my impressions are not naive. I have been working on research projects since last year with folks at UCLA and I have been attending regular reading groups. I also know what problems I want to work on; one of which I'm pursuing with a strong blessing from my adviser.

Even if this doesn't turn out as planned, at least I know that the "what" and "how" of my work will fit and that's why I'm getting a PhD.


Acknowledgments

Thanks to @keyist and @melliebe for their notes and revisions.

Special thanks to Professor Philip M. Johnson who has helped me continuously since I graduated from the University of Hawaii more than seven years ago. His advice has been extremely important in my long and winding path back to academia.

Footnotes

  1. Presentation: Faster JavaScript through Category Theory
  2. Presentation: Math Envy and CoffeeScript's Foibles
  3. "Less" is not, "not at all"

I also have a short list of things I'm not interested in. I intended to include this in the main body of the post but it seemed like a distraction.

As a web developer and someone who's worked almost exclusively on the client side for the last few years I have a fairly long list of things that do not interest me or are an active source of frustration. Nearly all of them would be impossible to describe as "general" problems.

  • Esoteric browser bugs (yes they still exist in abundance)
  • The culture/necessity of JavaScript micro-libraries
  • Effort required to provide broad access to web based content
  • Dynamic programming languages as an industry default
  • Non-transferable esoterica (e.g., how iptables works on CentOS, Chef's attribute resolution order)
  • Large JavaScript projects

I'm not above working on, or around, these problems. For a web developer many of these are simply facts of life. I'm just not content to let them bother me indefinitely which means contributing to solutions or moving on.

Vote/Comment

Published

16 Sep 2013

Tags