How to read programming documentation

Make wrong look wrong

It was a long time ago, September 1983, when I started my first real job. That was at Oranim, a large bread factory in Israel, where around 100,000 loaves of bread were baked every night. In six gigantic ovens, roughly the size of ... well, aircraft carriers.

When I walked into the baking room for the first time, I could hardly believe the mess. The furnace walls were yellowing, machines were rusty, and there was oil or grease everywhere.

"Does it always look like this here?" I asked.

"How? What are you talking about?" the boss asked back. "We have only just finished cleaning. It has not been as clean as it is in weeks."

Oh dear!

For several months now I cleaned the baking room every morning before I finally understood what he had meant. In a bakery, clean means that no dough sticks to the machines. Clean means no dough in the garbage can and no dough on the floor.

But clean does not mean that the ovens are painted beautifully white. You paint an oven once every ten years, not every day. Clean does not mean: no grease. Rather, there were a number of machines that had to be oiled or greased on a regular basis. A thin film of grease there is more a sign that the machine has only recently been cleaned.

So cleanliness in a bakery must be understood as a separate concept. It is not at all possible for an outsider to just walk in and be able to judge whether it is clean here or not. It would never occur to an outsider to look inside a dough former to see if it had been scraped clean. An outsider would obsessively complain that the stove had completely discolored tiles; the huge tiles immediately caught your eye. For a baker, however, there is nothing less important than the changing color of the outside of his oven. The bread will still taste good.

After two months in the bakery, I had learned to "see" cleanliness.

It is no different with program code.

When you are just starting out as a programmer, or when you try to read the code of a programming language that is new to you, everything seems incomprehensible to you. Until you understand the language on your own, you won't even spot any obvious syntax errors.

In the first learning phase you will already recognize what we call coding style. For example, you notice bits of code that don't conform to the indentation standard and you notice unusual capitalization.

At that point, you typically realize: "Damn it too! We urgently need uniform coding conventions here". Then you spend the next day writing them down for your team, the next week arguing about the only right kind of parentheses, and the next three weeks adapting old code to the new one-right parenting style, until finally the boss is squeezing you because you are wasting time on something that will never make money, and then you decide that it is not a bad solution to adjust your code until you have it in your hands for other reasons anyway. About half of the code now follows the only right parenthesis style, and ... pretty soon it will be forgotten again, and then you can devote yourself to a new irrelevant (in the sense of making money) idea such as ... replace one string class with another.

But with increasing coding experience you will learn to recognize different things. Things that can be perfectly correct, even according to the coding conventions, but which still worry you.

An example (in C):

char * dest, scr;

That is valid C-code, and it may conform to your coding conventions, and it may even be exactly what was intended. But with some experience in C you will understand that with it least as char-pointer is declared, src on the other hand only as char. And even if that's exactly what you intended, it's probably not what you want it to be. Something is wrong with the code.

Somewhat more subtle:

if (i! = 0)ball (i);

Here the code is one hundred percent correct; it also complies with most coding conventions. Actually nothing is wrong, only that the individual instruction after the condition is not in curly brackets, that gnaws at you, because now the question arises: Uh, what if someone inserts a line like this:

if (i! = 0)foot (i);ball (i);

... and forgets the curly brackets. And hey presto, drop out of the condition! From now on, the sight of code blocks outside of curly brackets gives you an inkling of uncleanliness and you feel uncomfortable with it.

Well. So far I have described three learning levels of programming:

  1. You can't tell clean from unclean.
  2. You have a vague idea of ​​what cleanliness could be, especially when it comes to coding conventions.
  3. You start to see signs of uncleanliness beneath the surface, and that's enough for you to pick up Kode to fix it.
  4. There is also another level, and that is what I want to talk about now.

  5. You consciously construct your code in such a way that, along with your nose for impurities, it tends to become a correct one.

And then that's the real art: generating robust code by following conventions that you only have for that invented to spot errors on the screen.

In the following I would like to show you a small example. Then I will give a basic rule for inventing code robustness conventions. And in the end, this will lead to the defense of a variant of the Hungarian notation - even if not the variant that turns your toenails upwards - as well as a criticism of exceptions in a certain environment - even if you rarely work in such an environment will work.

But if you are absolutely convinced that the Hungarian notation is the devil's stuff, and Exceptions are the best since the invention of white chocolate, if you don't want to hear any other opinion, then stop over to Rory and watch his excellent comix instead - you probably won't miss anything here anyway. To be clear, now I'm going to bring you real code examples that will probably make you fall asleep before you even have a chance to get angry. Exactly! The best thing will be to lull you until you are very sleepy and can no longer defend yourself ... and then I'll give you Hungarian = good and Exceptions = bad.

An example

Well. So here's the example. Let's say you're building any web application you want - that seems to be what turns young people on today.

Now there is a leak there that you can Cross site scripting called, also known as XSS. I want to save the details; It is only important to know that in a web application character strings that are typed in by the user on a form cannot be returned unchecked.

For example, a web form asks "What's your name?" and has an input field for it. Submitting the form takes you to the next screen that says "Hello Elmer!" If the user is Elmer ... well, that's a security hole. That's because instead of "Elmer" the user could type in some wild HTML or JavaScript code, and that code could do nasty things, and it would look like you did it; For example, the code could read out cookies you have stored and forward them to Dr. Evil Evil web pages.

Let's write that down in pseudocode. Let's assume that

s = ask for ("name")

reads the input (a POST variable) from the HTML form. If you only once

Write "Hello" & Ask for ("name")

your web server is already vulnerable to XSS attacks; nothing more is needed.

Instead, you have to recode the strings before you output them as HTML again. To recode is to replace with, replace with, and so on. So that is

Write "Hello" & recode (Ask for ("name"))

absolutely safe.

In other words: All strings that originate from the user are potential dangerous. No potentially dangerous character string may be output again without recoding.

Let us now try to invent a coding convention to ensure that an error as described above immediately looks wrong. If the wrong code looks wrong at least, then there is a fair chance of spotting it while typing or proofreading it.

Solution - 1st attempt

One solution is to recode every string as soon as it is sent by the user:

s = recode (ask for ("name"))

So our rule says that whenever you get a naked Ask for() see that must be wrong.

So you train your eyes on it, single ones Ask for() to discover because they would break the rule.

This works to the extent that following the rule effectively protects against XSS problems, but it has another weakness. For example, it could be that you want to save the user input somewhere in a database, but this may be pointless in HTML coding because it is not intended for HTML pages, but for example for a credit card application that cannot do anything with HTML. Most web applications are therefore developed on the assumption that character strings are not recoded internally, but only at the last moment before they are added to an HTML page. This is probably the better way to proceed.

We always have to be able to carry things with us in an unsafe form for a while.

Ok, one more try.

Solution - 2nd attempt

How about if we set up a rule that says that every string must be recoded the moment it is printed?

s = ask for ("name") ... // much laterWrite Recode (s)

Now, if you see a naked one somewhere without, you know something is missing.

Well, that's not how it works, ... because sometimes you still have some HTML fragments in your code that you are not allowed to recode:

If mode = line feed then prefix = " ... "// much later

Letter prefix

That in turn looks wrong according to our coding rule, which says that every character string must be recoded before output, so like this:

Write Recode (Prefix)

But then the "

", which should start a new line in HTML, is encoded as & lt; br & gt, and then appears in the display as . So it's not right either. So: sometimes strings cannot be recoded while reading, sometimes not when writing, and neither the first nor the second suggestion works here. And without any rule, we risk this here:s = ask for ("name") ... pages latername = s ... pages later // save in a column "name" in a databaserecordset ("name") = name ... Days later

aName = recordset ("name")

... pages or months later

Write a name

  • Didn't we forget to recode the string either? We cannot limit the review to a single point - there is no such thing. If you have a large source code, it becomes detective work to track down the origin of every string if you want to make sure that everything that comes out as HTML has been recoded. The right solution) Now let me introduce a solution that works too. It has exactly one rule: All strings that come from the user are stored in variables whose names start with, for (Unsafe string) unsafe string, start. And all strings that have been HTML-encoded, and all that come from a source known to be safe, are stored in variables whose names start with, for (

Safe string

safe string, start. I will now write down the code from above again, only adapting the variable names to the new coding convention.us = ask for ("name") ... pages laterusName = us ... pages later // save in a column "name" in a databaserecordset ("usName") = usName ... Days later

sName = Recode (recordset ("usName") ... pages or months laterWrite sName

I want you to recognize the following aspect of the new rule: If you make a mistake with an unsafe string, you can

always see in the same line of code

if only the coding rule applies! So:

s = ask for ("name")

is wrong from the start, because you can see that the result is assigned by a variable that begins with. It's against the rule. The result of is fundamentally uncertain and must therefore be assigned to a variable whose name begins with.

us = ask for ("name")

is always OK.

usName = us

is always OK.

sName = us

is of course wrong.

s = recode (us)

is of course correct.

Write usName

is of course wrong.

Write sName is OK, as well as:Write Recode (usName)

Each line of code can can stand alone and, if every single line is correct, then the entire code is correct. With this coding rule, your eye gradually learns to eachsee

and to recognize it as wrong, and you then immediately know how to correct it. Sure, in the beginning it is difficult to enter the wrong code

see ... , but do it for 3 weeks, and your eyes will be trained on it, just like the worker in the bread factory, who learned to look into the big baking hall and instantly mumble: "Uh ... take a long time, huh? ... what a pigsty! " ... We can extend the above rule and rename the functions and - or wrap (wrap) - so that they are called and, just like with the variables. The code now looks like this: ... us = UsFaltenNach ("name") ... usName = us

recordset ("usName") = usName

sName = SUmcodieren (recordset ("usName") = Write sName Do you see what i did? You can now search for errors by checking whether both sides of the assignment start with the same prefix.us Us = Ask for ("name") // ok, "us" on both sidess UsAsk for ("name") // not correctus Name =us // OKs Name =us // wrong of coursesName =

S.

Recode (us) = // correct, of course Watch out! I can even go a step further by doing off and on: ... usUs Ask for ("name") ... usName =us recordset ("us ... Name ") =us SurnamesName =S.FromUs ... (recordset ("usSurname")Write

S.sSurname

Making wrong look wrong is great. It's not necessarily the best solution to every security problem, and it doesn't prevent every bug, because you may not be looking at every line of code after all. It's always better than nothing - at least that's how I prefer it:

With a coding rule that makes errors visible.This automatically results in a step-by-step code improvement, because every time a programmer's eye skims the code lines, the correctness is checked and errors are prevented. A general rule

Our method of making wrong look wrong relies on showing the right things close together on the screen. Let's take a string - if I want to have the correct code, I have to know every time the variable occurs, whether the content is safe or not. I don't want to have to search for this information, I don't want to scroll through the screen or look at another file. I have to

  • in place
  • and from this follows a naming convention for the variables.
  • There are many other examples of code improvement by moving things together. Most coding conventions contain rules like these:
  • Keep functions short.
  • Declare variables as close as possible to the place of use.

Don't build your own programming language with macros

Don't use GOTOs.

Open and close brackets on one side of the screen.

What all these rules have in common is that they try to bring the relevant information about what a line of code does as close together as possible. This increases the chance that your eyeballs can see what's going on.

In general, I have to admit that I am afraid of language properties that hide something. Look at the following code:

i = j * 5;

... and in C you know reliably that it is multiplied with and the result is saved in.

If you see the same thing in C ++, you don't know anything at first. Absolutely nothing! The only way is to find out what type and each are, something that can be declared elsewhere. It could be that he's of a type who has so overloaded it that something insanely tricky happens there when you use "multiplication". And can be of a type that is so overloaded that the types involved do not match, so that automatic type conversions are eventually called. To find out, not only do you need to determine the type of variable, but you also need to examine the implementation code of those types, and God bless you if inheritance is involved, because now you will have to go through the whole class hierarchy to find out where the Code is actually stuck, and if there is still polymorphism somewhere, then you have a real problem because it is not enough to know which type and are, but you have to know what types they appear as in the given situation, so you have You have to examine an unpredictable amount of code and you can never be sure that you have everything in view - thanks to the holding problem (Phew!).

So if you look anywhere in C ++, you're all alone, and that, I mean, interferes with the ability to discover problems through code reading alone.

None of this was intended, of course not. If you can, super smart, overload, it was meant to be a watertight abstraction. Eureka, is of the Unicode string type and an obviously super smart abstraction to convert from Traditional Chinese to Standard Chinese is to multiply the Unicode string by an integer, right?

The problem is of course: watertight abstractions are not. I have already written about this in detail in The Law of Leaky Abstractions - I do not want to repeat myself here.

Scott Meyers made it his job to show how abstractions fail and pee on your leg, at least for C ++. (By the way, the third edition of Scott's Effective C ++ just came out [May 2005, A.d.Ü] - so get it!)

Well.

I digress too much. It is better to summarize the story so far:

Look for code conventions that make what is wrong look wrong. Bring related information together visibly in one area on the screen - then you will be able to see and correct certain problems with your code immediately.

I'm hungarian

So now to the infamous Hungarian notation.

It was invented by Charles Simonyi, a programmer at Microsoft. One of the larger projects Simonyi was active on for Microsoft was Word; more precisely, he led the project for the world's first WYSIWYG writing program, known in Xerox Parc as Bravo.

With a WYSIWYG writing program, the working window can be moved. So every coordinate must be able to be interpreted relative to the window on the one hand and relative to the document page on the other. It is vital to distinguish between them. That must have been one of the many good reasons that prompted Simonyi to adopt a style that was later called Hungarian notation. He looked like Hungarian and Simonyi was from Hungary. Therefore the name. In Simonyi's version of Hungarian notation, each variable name is prefixed in lowercase letters; this indicates the nature of the content of the variable. For example, if the variable is called, then the prefix is. I got by on purposeArt

of the content spoken. Simonyi stupidly used the word in his essay

Type

... and generations of programmers got it wrong. If you study the essay carefully, you will see that Simonyi aims at the same kind of naming convention that I used above, that is, where "unsafe string" and "safe string" meant. Both are of type. And if you assign one to the other, the compiler will not notice anything and Intellisense will not help you either. But the two are semantically different, they have to be interpreted differently, and some kind of conversion function is needed to transfer one into the other. Otherwise there will be errors at runtime ... if you are lucky. Simonyi's original concept was called Apps Hungarian (Apps Hungarian) internally because it was used in the Application Division, i.e. for Word and Excel. In the source code of Excel there are tons of and, where one immediately thinks of rows and columns. Exactly, both are integers, but it is nonsense to assign one to the other. In Word, I've been told, there are plenty of and, where "X-coordinate relative to the layout" means "X-coordinate relative to the screen window". Both integers. Not replacable. There are many in both applications, which means "count of bytes". Exactly. Again an integer, but just by seeing the name you know a lot more. It's a byte counter, the length of a buffer. And when one comes in front of your eyes, the alarm bells will ring because that is obviously wrong, because although these are integers, it is still complete nonsense to provide a horizontal distance with the content of a byte counter.

In Apps Hungarian, prefixes are used for functions just like they are for variables. To be honest, I've never seen any Word source code. But I'll bet a hundred to one that there is a function there that converts vertical window coordinates to vertical layout coordinates. Apps-Hungarian requires the -notation instead of the traditional notation, which means that every function name with the variable

art

that it returns - just like I did earlier when I replaced with. In fact, Apps-Hungarian would ask for that, there is no other choice. This is a good thing because you don't have to memorize as much and because you don't have to puzzle over what "recoding" actually means; you have something more precise in hand.

Apps-Hungarian was extremely valuable, especially in the days of C programming, when the compiler type conversion left a lot to be desired.

Then someone somewhere made a fatal mistake ...

... and evil took over the Hungarian notation.

Nobody seems to know exactly how or why this happened. However, it appears that manual authors on the Windows team inadvertently invented what later became known as Systems Hungarian.

Somebody read Simonyi's essay somewhere, and wherever he used the word "type", he thought of type, in the sense of class, in the sense of the type check that the compiler does. But that's not what it meant! Rather, he had carefully and precisely described what he meant by "type". For free. The disaster happened.

Apps-Hungarian had very useful prefixes like "ix" for index of an array, "c" for counter (count), "d" for the difference of numbers (for example "dx" for width), and so on.

Systems Hungarian has far less useful prefixes: "l" for "long", "ul" for "unsigned long" or "dw" for double word, which is nothing other than ... uh ... "unsigned long". The prefix in Systems Hungarian doesn't tell you more than just the data type of the variable. It was just a minor misunderstanding but a complete misinterpretation of what Simonyi wanted and did, which again just goes to show that if you write convoluted academic prose, you will not be understood. Your ideas will be misinterpreted, and the misinterpreted ideas will then be ridiculed, even if they are not yours. In Systems Hungarian you have a lot of dwXyz, meaning "double word Xyz", but forget it: the fact that a variable contains a double word tells you damn little, actually nothing useful. No wonder there was a rebellion against Systems Hungarian. Systems Hungarian was proclaimed near and far, it is the standard standard of Windows programming documentation, it has been multiplied by books like Charles Petzold's Programming Windows, the Bible for Windows programming, and it quickly became the dominant form of "Hungarian", even by Microsoft itself, where there were few programmers outside of the Word and Excel teams who understood the mistake that had been made here.

Then came the great uprising. Gradually, programmers who hadn't understood Hungarian notation from the start noticed that the misunderstood subset they were using was absolutely annoying and pretty useless. Well, that again is not entirely true; there are still properties that allow a bug to be discovered more quickly. At least with Systems-Hungarian you will know which one straight away

Type

is a variable. But it's nowhere near as useful as Apps-Hungarian.

The Great Rebellion reached its climax with the first delivery of .NET. Finally, Microsoft issued the slogan: "Hungarian notation is not recommended". It was a pleasure! I don't think they even bothered to say why. You just looked at the chapter on naming guidelines and wrote "Do not use Hungarian notation anymore!" On top of each article. By then, Hungarian notation was so unpopular that no one dared to protest. And everyone - except for the Excel and Word teams - was relieved to get rid of this annoying naming convention. After all, there are powerful type tests these days and there are also intellisense, right?

Nonetheless, Apps Hungarian remains of tremendous value. It improves the linguistic cohesion of the source text, which makes it easier to read, write, dewant and change. And most importantly, it makes wrong look wrong. Exceptions Before I come to the end - I promised - there will be another beating for the exceptions. The last time I did this, I got in a lot of trouble. I had written on the Joel-on-Software homepage that I didn't like exceptions because they were ultimately hidden GOTOs, which, I reasoned, was even worse than GOTOs that you can see. And then of course it hit me. The only one who rushed to my defense was Raymond Chen, who by the way is the world's best programmer, and that says something, doesn't it?

So here, in the context of this article, is the matter of exceptions. Your eyes learn to see wrong as long as there is something to see, and that prevents mistakes. If you want to get really robust code with the help of code inspection, you need coding rules that ensure linguistic cohesion. In other words, the more information about what the code does on the spotsee

the better you will do your job of finding mistakes. Look at this code: Do something()To clean up()

Are your eyes telling you what's wrong? We clean up every time !? But the possibility that an exception will be raised means that it may not be called at all. Well that would be easy to fix by using or something, but that's not critical. What I'm getting at: If you want to know if the call is actually being made, you have to go through the whole code tree to see if there is anything anywhere that could throw an exception. That can then be okay, and there are also checked exceptions that make things easier for you. The crux of the matter is that exceptions dissolve the linguistic context. You have to

somewhere else

look to be able to answer the question of whether the code is doing what it is supposed to do, and by doing so you cannot use the inherent ability to see errors. There is nothing that can be seen.

Sure, for a little script that gathers a handful of data and displays it once a day, exceptions are great for that. Nothing is nicer than being able to forget what can go wrong, and so I put the entire program in one big one and have mail sent to me when something happens. Exceptions are great for "quick and dirty" programming, for scripts, for uncritical code on which no life depends. But if you're writing an operating system, or software that controls a nuclear power plant or high-speed circular saw for open heart surgery, exceptions are extremely dangerous.

I know many will think I'm a bad programmer because I don't understand exceptions, because I don't want to understand how they could make my life more comfortable if I just opened my heart to them. Forgiven labor of love! Really reliable code is only written using simple methods that take typical human weaknesses into account. That doesn't work with complicated methods and their hidden side effects or their leaky abstractions that require an infallible programmer.

For further reading

 

If Exceptions still makes you euphoric, read Raymond Chen Cleaner's essay, More elegant, and harder to recognize. "It's extremely difficult to tell the difference between good and bad exception-based code ... Exceptions are too difficult and I'm not smart enough for them."
In Raymond's tirade against death through macros A rant against flow control macros is about another example where you look for information in vain and the code is therefore not maintainable. "When you see code that contains macros, you have to dig through header files to finally find out what they're doing."