Log in

No account? Create an account
20 May 2014 @ 03:30 pm
Announcing the release of "AWK Essential Training" at lynda.com  
Last year fellow Portland writer and Analog Mafia member Mark Niemann-Ross asked me to write and record a course for his employer, video training company lynda.com, on the AWK programming language. I recorded it in April, and the finished course, AWK Essential Training, is now available to all lynda.com members. If you aren't a member, you can watch the first six videos in the course for free at http://www.lynda.com/Linux-tutorials/AWK-Essential-Training/162719-2.html.

Awk2Topics covered in the course include:
  • What is AWK?
  • Writing an AWK program
  • Working with records, fields, patterns, and actions
  • Specifying field and record separators with variables
  • Using built-in and user-defined variables
  • Building control structures
  • Formatting output
  • Manipulating string data with functions
  • Scripting with AWK

"So what is this AWK thing," you might ask, "and why on Earth should I care about it?" AWK is a tool and programming language for manipulating text files. For example, if you have a file of names and addresses and want to find out how many of them are from each US state, you can do that with just a few lines of AWK code.

AWK is older and more limited than similar but more modern tools like PERL and PYTHON, but its simplicity makes it easier to learn. Also, AWK is preinstalled on most UNIX-based systems, including Linux and Mac OS X, so if you use any of these machines AWK is right there whenever you need it. It's also available for Windows.

I actually love AWK and use it just about every day, so I'm very pleased to have this opportunity to help people learn about its capabilities. AWK Essential Training went live yesterday and has already been seen by 303 viewers in 47 countries.

My experience with lynda.com so far has been both fun and profitable, and I look forward to recording more courses for them in the future. If you are interested in doing something like this yourself, please contact Mark Niemann-Ross at mnr@lynda.com. He is especially interested in finding authors who are women or people of color. If you have expertise in any technical or business field, have good English writing and speaking skills, and enjoy helping people learn how to do things, I encourage you to give it a try.
Kalimac: puzzlekalimac on May 21st, 2014 02:28 am (UTC)
if you have a file of names and addresses and want to find out how many of them are from each US state

Is that text files, as opposed to databases? If so, can it reliably distinguish, say, Washington D.C. from the state of Washington, or Arkansas City, Kansas, from the state of Arkansas, without being explicitly warned about each particular example?
David D. Levinedavidlevine on May 21st, 2014 03:00 am (UTC)
Assuming you have the addresses in a text file with some recognizable format (e.g. tab-separated) and the state name is in a consistent place (e.g. in a field by itself or always at the end of the city field), awk can easily distinguish between Washington, DC and the state of Washington. That's one of its strengths.
eub on May 21st, 2014 07:35 am (UTC)
Hm, what would be an example of a system that has an issue with distinguishing these? I ask because I must have a mental blind spot, which is not good even if I don't intend to go there. Some type of search model where substring containment is the only query available?
Kalimac: puzzlekalimac on May 21st, 2014 08:01 am (UTC)
Let's specify, first, that the state names are in full, not abbreviations. Then, if this is a text file and not a database file, how is it going to know which part of the string to search in?

And, if as DDL specified it's tab-separated, and if AWK can actually be directed how many tabs over to count - which itself would be impressive for a program to do in a text file - can we be certain that the file is correctly formatted? If this is a name-and-address file, for instance, some street addresses will have one line and some will have two. My experience in manipulating such files is that you cannot count on the one-line addresses always having a blank tab in the appropriate place. In which case the search function will diligently look in the city field when it thinks it's looking in the state field, or vice versa.
eub on May 21st, 2014 08:49 am (UTC)
OK, I see what you mean. Yeah, no large text file ever comes in the door with 100% well-formatted data.

If the line has extra / missing delimiter characters that push the state into the city position or vice versa, awk is going to follow that as written unless you tell it how not to. I can't think of unusual facilities awk has for this, but maybe David can...

If someone faces messy data and delimiter trouble, I would actually warn about one thing with awk, which is it's not so flexible at escaping. If I have unwisely fixed on comma as my field separator, and only later realized that my "name" field sometimes should include a comma -- awk doesn't to my knowledge let that name be entered by a means like "Oz\, the great and powerful". You see workarounds like this. Guess I shouldn't have picked comma, but it's a tricky game to choose some character that will never ever legitimately occur in your data.

Over time, as more wack input needs to be dealt with, the gravitational pull is towards an approach where you slurp in the whole record or window of records, and parse out the pieces yourself, instead of using built-in parsing logic. Then you can curse the heavens and add in "does it look like a spurious newline got into my one-line address", "is there a number in my state and a two-char string in my ZIP code", etc. etc.

awk is great for one-liners, and I use it every day. I have actually written a multi-thousand line awk script, and it worked fine, but there were extenuating circumstances...
David D. Levinedavidlevine on May 21st, 2014 04:25 pm (UTC)
The course focuses on one-liners, which is how I personally use awk most often, but also shows some scripts with up to a couple dozen lines. If I found myself writing an awk script over 100 lines I would probably consider shifting to a different tool.

And, yes, awk doesn't deal well with escaped delimiters. In the case of the escaped comma (starring Sherlock Hemlock) I'd use sed to change unescaped commas to tabs and escaped commas to commas, then feed the result to awk -Ft.
David D. Levinedavidlevine on May 21st, 2014 04:19 pm (UTC)
Awk excels at "counting the tabs" -- that's the heart of what it does. And there's actually an example in the course of how to deal with inconsistently-formatted data. If the data is REALLY inconsistent, of course, there's little any software can do without substantial additional work.
Kalimac: puzzlekalimac on May 21st, 2014 02:33 am (UTC)
Sorry, that was curiosity burbling. I bet I could find out by watching all the videos. It's very cool that you did this, and such versatility ... first Dr. Talon, now this!
scarlettina: Fantastic!scarlettina on May 21st, 2014 02:14 pm (UTC)
It looks and sounds great, David!
David D. Levinedavidlevine on May 21st, 2014 04:16 pm (UTC)
I'm very happy with how it came out. Every stumble and hesitation has been cleanly elided, and they've added video highlighting to make it clear what I'm talking about.
Karenklwilliams on May 21st, 2014 04:22 pm (UTC)
I was taking a Coursera course on the programming language R and gave up on it because it was vague and encouraged us to learn the "hacker way" -- by ourselves, essentially. So I found lynda.com, which has a much better course.