Turning automatic code generation upside down

share on: 
Upside down home
Table of Contents

Much ink has been spilled on the Next Big Thing in software development. One of these things has always been “automatic code generation” from high-level models (e.g., from state machines). But even though many tools on the market today support code generation, their widespread acceptance has grown rather slowly. Of course, many factors contribute to this, but one of the main reasons is that the generated code has simply too many shortcomings, which too often require manual “massaging” of the generated code. But this breaks the connection with the original model.

"Round-Trip Engineering"

The tool industry’s answer has been “round-trip engineering”, which is the idea of feeding the changes in the code back to the model. Unfortunately, “round-trip engineering” simply does not work well enough in practice. This should not be so surprising, considering that no other code generation in software history has ever worked that way. You don’t edit by hand the binary machine code generated by an assembler. You don’t edit by hand the assembly code generated by the high-level language compiler. This would be ridiculous. So, why modeling tools assume that the generated code will be edited manually?

round-trip engineering

The Beaten Path to Code Geneartion

Well, the modeling tools have to assume this, because the generated code is hard to use “as-is” without modifications.

First, the generated code might be simply incomplete, such as skeleton code with “TODO” comments generated from class diagrams. I’m not a fan of this, because I think that in the long run such code generation is outright counterproductive.

Second, most code generating tools impose a specific physical design (by physical design I mean partitioning of the code into directories, and files, such as header files and implementation files). For example, for generation of C/C++ code (which dominate real-time embedded programming), the beaten path is to generate .h and .cpp files for every class. But what if I want to put class declaration in a file scope of a .cpp file and not to generate the .h file at all? Actually, I often want to do this to achieve even better encapsulation. A typical tool would not allow me to do this.

And finally, all too often the automatically generated code is hard to integrate with other code, not created by the tool. For example, a class definition might rely on many included header files. But while most tools recognize that and allow inserting some custom beginning of the file, they don’t allow to insert code in an arbitrary place in the file.

The Freedom of Physical Design

But, how about a tool that actually allows you to do your own physical design? How about making the physical design an integral part of the modeling process, just like the logical design? How about turning the whole code generation process upside down?

A tool like this would allow you to create and name directories and files instead of the tool imposing it on you. Obviously, this is still manual coding. But, the twist here is that in this code you can “ask” the tool to synthesize parts of the code based on the model. (The “requests” are special tags that you mix in your code.) For example, you can “ask” the tool to generate a class declaration in one place, a class method definition in another, and a state machine definition in yet another place in your code.

This “inversion” of code generation responsibilities solves most of the problems with the integration between the generated code and other code. You can simply “ask” the tool to generate as much or as little code as you see fit. The tool helps, where it can add value, but otherwise you can keep it out of your way.

The idea of “inverting” the code generation is so simple, that I would be surprised if it was not already implemented in some tools. One example I have is the freeware QM model-based design tool from Quantum Leaps. If you know of any other tool that works that way, I would be very interested to hear about it.


12 Responses

  1. Following these arguments, why do automatic code generators give us C/C++ and not raw machine code? If I’m not supposed to change anything in the generated code, then it really doesn’t make any sense to have it handed in a high level language. There are two reasons why anyone would want it like that, either 1) they expect to make manual changes to the code somehow, and/or 2) they don’t trust the tool and want to verify that it does what it is supposed to do.

    1. I think that when the automatic code generation technology matures and gains more widespread acceptance, the tools could indeed generate machine code directly.

      But, these are early days yet and model-to-code generation follows pretty much exactly the same trajectory as all other code generation technologies in the past. For example, most early C compilers generated assembly code (in fact, many embedded compilers still do). The early C++ compilers were all based on “cfront”, which compiled C++ to C. And so on.

      Such gradual, stepwise approach has many obvious advantages. It allows people to get used to the “new” by seeing how it turns into the old and familiar first. It allows leveraging the existing tools. A young technology cannot cover all the bases at once. For example, by generating portable C/C++ the code generator can address many more processor types than by generating specific machine code.

      The intermediate step of C/C++ also allows the developers to use the existing debuggers. This is not quite ideal, because debugging at the C/C++ level is below the model level. But in practice the inconvenience depends very strongly on the type of the generated code. For example, if such code uses compressed hexadecimal state-tables to represent state machines (e.g., IAR visualSTATE), you obviously have no chance of bridging the semantic levels.

      But if the code is designed upfront to be human-readable, you can quite easy see the model structure from the code. This is exactly the approach taken in the QM tool, which is based on the QP framework. QP has been originally designed for manual coding without “big tools” (see my first book “Practical Statecharts in C/C++”, published in 2002). But it turns out that QP makes also an excellent target for automatic code generation. On top of this, QM adds special comments to the generated code, which cross-reference the code snippets to the model. While debugging the application, you can simply copy the closest such comment to the Clipborad and paste it to QM. QM then will immediately locate the corresponding model element, open the diagram and highlight the class method, state, transitions, guard, or whatever that is. With this simple method you almost debug at the model level.

    2. Because C/C++ compilers have literally hundreds of man-years of design for optimizing into machine code for many different platforms. If you have code generators doing it, then they have to (1) understand how to optimize, and (2) understand how to generate many different types of assembly code.

      I do, however, agree that C/C++ can be clumsy sometimes as an intermediate language; the right way to do this is to have the code generator bypass C/C++ and then emit something like asm.js or LLVM IR; the back-end of the compiler can handle the optimization and assembly-code generation step.

  2. Automatic code generation is used in the c/c++ preprocessor already #define()/#include …, in Lisp since decades. Most scripting languages enable meta or macro programming. QM is definitely a tool on the right track. Once c/c++ -source will be just another option beside bytecode, assembly, executables and others.

    In future I see the complete program as an abstract syntax that is displayed as state machine, c-source, graphs or what ever representation is best suited for the current task.

    As long as the involved tools preserve the hierarchy of abstraction of the program one can create state machine templates, HTML handlers/generators, DSL , documentation etc.

  3. Hi Miro,
    Discovered this post a bit late, but anyway…

    First: As I’m the product manager of the IAR visualSTATE product I would like to add a bit of information to the paragraph about our code generation. The user has the option to choose between table-based code generation(which is obviously a bit ‘difficult’ to decipher…) and what we call “readable code” which is a straight translation of the state charts into switch and if-statements. What is a bit surprising is that a large majority of our users stick to the table-based code…
    When asked about this they often answer that they don’t care about the generated code because they never have a need to change the code. Further, the code size needed for the ‘driver’ code to interpret the tables are well below 1k if compiled with a modern compiler. Typical numbers are more in the 400-600 bytes range and can be as low as <300 bytes under certain circumstances.

    Second: The question about generating C/C++ or directly to assembly language has a quite obvious answer, at least for me: Who would ever want to support a modeling tool with code generation capabilities if the generated code should be optimized for speed or size for several target CPU's. (Or even just one?)
    I actually came across a customer a few years ago that had their own tooling for translating UML to assembly language for an 8-bit controller; but I see that as a rare exception and the company was not too happy about the situation due to the lock-in effects on hardware choices, the maintenance costs etc.

  4. Nice article Miro! What is missing though, and something I’m curious about, is why code generation is a good idea. What problem does it solve?

    Is it because we want more productivity? I don’t have much experience using the tools myself, but in the cases I have seen I have often reacted on the amount of code produced. It might be correct when generated, but as a programmer it is my responsibility to make sure it works after the modification is done. If the root cause is in the generated code it can be really hard to track it down. Typically code is read more than once, so in that sense code generators generate more work.

    Readable code is a much better approach in that sense. As is to continue with solutions that have worked: Higher abstractions: High-level languages (a relative term), libraries (with good APIs), and DSLs.

    If it is the modelling itself which is interesting there is another problem. In science models are simplified views (false by definition) that are used to enable reasoning. So a good model (simple!) must leave out a great deal of things that must be added by other means (or always be the same). I find simple UML diagrams useful to certain aspects of a design, but complete models that can generate code are quite often confusing, even misleading (a small class looks big because of its many methods). As an example of things left out see the comments: State machines will express the required behaviour, but there are also requirements regarding code size.

    Maybe there are other reasons. The thing is while this article discusses “How to make it work?” I feel that “Why should we do it?” is not properly answered.

    1. The short answer to your question is that the purpose of code generation is to avoid repeating the same effort twice, first when you design a model and then when you code it. So, in general terms, code generation is simply adhering to the DRY principle (Don’t Repeat Yourself!). I hope I don’t need to explain why the DRY principle is an essential good practice for developing software.

      To embrace code generation, you first need to be convinced of the value of modeling, such as using (hierarchical) state machines to represent the behavior of the system components. From my experience, designing state machines graphically delivers better state machines quicker than coding them directly in C or C++.

      At the same time, I agree with you that not all aspects of the system are equally well suited for graphical modeling and often the code itself is the best solution to the problem.

      But this is exactly my point in this post. I specifically claim that an effective modeling and code generation tool must necessarily allow easy integration with hand-written code. The whole idea of “turning the code generation upside down” is about a tool that can be asked to help with generating code, when such help is actually productive, but otherwise the tool can be easily kept out of the developer’s way.

      1. DRY is a valid argument, but I was thinking more in economic terms. But that was all nicely covered in “Economics 101: UML in Embedded Systems”. Didn’t read it before I posted my comment, just now.

        There are many ways to model and to generate code, so that was the main reason for my question. Reading the other post I think that the main reason why the tools have not taken off is their general nature, trying to solve all aspects of modelling. A common denominator for tools that are popular and that I like is that they don’t do everything. Do few things, but does them great.

        1. I don’t know about the QM tool, but when I’ve worked with code generated from graphical models (simulink and targetlink), a major source of frustration was how bad the diffs looked.
          Someone would add a condition to an if statement, move the little boxes a little to make room for the new one, and comparing the versions of files would become a mess of box coordinates changes.
          At the time, the build time was also painfully long (more than one hour, the year was 2008).

          1. I absolutely agree that the unnecessary churn in the generated code is very frustrating. Therefore, reducing the code churn is one of the main priorities of the code generator in QM. The tool uses several strategies to avoid re-generating code that hasn’t changed. For example, the tool first generates each file in memory and overwrites the file only if it is actually different. Also, cosmetic changes in the diagram layout never cause changes in the code, so no files are re-generated. Finally, the latest QM 3.1.0 changed the way it generates comments in the generated code that provide links between the model and the code. These changes in QM 3.1.0 were specifically made to reduce the code churn even further. Specifically, adding, removing, or re-ordering model elements (such as states in a state machine) does not change the generated comments.

  5. Hello,

    Interesting topic for me!

    I work for a company where we develop tools to generate device drivers for embedded platforms. A few comments from what I have seen..

    If I may be allowed to divide the solutions into neat categories, with no gray lines…

    The two approaches – a “wizard” generating code vs “code synthesis” – are quite distinct with their own pros/cons. A C source getting compiled to machine code is “synthesis”, an Eclipse plugin generating some headers/sources with comments is the “wizard”.

    The wizards can be developed and deployed very quickly, and provide attractive “bang-for-the-buck”. Code synthesis on the other hand needs much more design cycles and language design; but once done can provide an order of magnitude difference in efficiency.

    The essential difference between both is that the synthesis approach changes the level of abstraction – the programmer (and tester) doesn’t bother about the generated code anymore to a large extent. What percentage of C programmers would have felt the need to view the generated machine code, let alone edit it in a hex-editor? The tool would provide a defined (and restrictive) interface to “tune” the generated code, for instance we can think of the -O optimization flags of gcc. With great power comes less responsibility, in this case..

    IMHO, your post is about two implementations of the *wizard* approach. One – walk the user through a set of questions at start and then generate _some_big_thing_ that the user starts to hack up, or Two – present the user with (nearly) a blank slate and the user invokes one of many commands that generates _some_small_thing_, at the point “where the cursor is”.

    There are sufficient instances where we need the users to follow uniform file/class/function naming procedures, where the first approach would fit better. I mean, *if* that the generated code is organized as .h and .cpp is a *feature*, not a bug.

    As far as the second approach is concerned,
    a. Most editors have commands/macros that can generate a part of code – from something as simple as a for-loop or a function/class prototype with doxygen-style comment header; to something more complex like generating setters/getters from class prototype. Or auto-insert HTML tags when I’m typing raw HTML.
    b. If we allow a liberal interpretation of “code generation”, a php file that generates a HTML page on the fly, may be using several distinct php snippets scattered inline with HTML.
    c. Even pre-processor macros or their safer equivalents, inline functions, can be viewed as an application of the second approach?

    So I feel both approaches are indeed being used 🙂

Leave a Reply