Perl

From Citizendium
Revision as of 16:14, 8 March 2024 by John Leach (talk | contribs)
Jump to navigation Jump to search
This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
Code [?]
Addendum [?]
 
This editable Main Article is under development and subject to a disclaimer.

Perl is a dynamic, interpreted programming language created by Larry Wall and first released in 1987. Wall combined features of a variety of other languages, including C, Unix shell scripting, Lisp, awk, sed, and Unix tools such as the grep family, into a succinct language for system administration. Perl evolved into a flexible and powerful scripting language and garnered a substantial following with professional support. Perl interpreters now exist for most operating systems, and programs can usually be moved between different operating systems without needing to be changed.

One of Perl's advantages is its excellent string processing abilities. Perl's powerful regular expression engine has become an unofficial benchmark against which other programming languages' engines are measured. Due to its excellent support for strings and the large amount of publicly available modules, Perl has been widely used as a "glue" language between different kinds of technologies such as database access and web programming. Many system scripts for Linux distributions are written in Perl, and some commercial Unix systems install Perl by default. Perl is currently in version 5, a mature version which allows the creation, export, and import of objects and methods, and has an extensive public library of well-maintained modules and packages (CPAN).

Perl won many supporters due to its approach of leaving much choice to the programmer and not requiring anything that is not absolutely necessary. As an example, Perl does not require declaration of variable types prior to use. Instead, the interpreter decides the type of a variable based on how it is being used (and is generally quite successful at doing so). Such loose typing is nowadays called Duck typing ("if it walks like a duck, and quacks like a duck, it must be a duck"). Perl's motto became "there's more than one way to do it", and this policy had an important influence on the newer Ruby programming language.

Examples

For short programs, a Perl script can be invoked directly from the command line, using the '-e' option:

Code

Result

$ perl -e 'print "Hello, world!\n"' Hello, world
$ perl -e '$g='Hello'; $g=~s/e/a/; print "$g, world!\n"' Hallo, world
$ perl -e 'print grep(s/^\d+(.*)/$1 /, sort(split(/ /,"8hacker, 4Perl 1Just 2another")));' Just another Perl hacker,

The second statement in the second example shows an aspect that is often discouraged for the sake of clarity, the option to write very compact (terse) code. In the middle statement, the 'e' in $g is replaced with an 'a'.

In usenet days it was customary to sign one's posting in a Perl thread by a one-liner that produced the string "Just another Perl hacker," (JAPH), the master of which was Randal L. Schwartz, author of several Perl books. The third line is one of his simpler examples, but already too involved to explain in the context of an introductory article. You can see the analysis here. Both examples 2 and 3 contain regular expression matches, which are introduced further down. Real world Perl programs are usually stored in files and passed as parameters to the Perl interpreter. Some mechanisms of this form of invocation depend largely on the host's operating system, see a separate example for more details.

Syntax highlights

It is impossible to give a complete overview of Perl's syntax here. The "Camel" book [1] has over 1000 pages in its 3rd edition. But a few highlights may give some idea of the character of Perl to the interested reader.

Variables

Data types

  • Scalars are the fundamental data type. A scalar stores a single, simple value, usually a string or a number, or a reference to another variable. A scalar is prepended by '$', e.g. $var = 1;
  • Arrays are ordered lists of scalars, where each element can be accessed by an index (integer). An array is prepended by a '@', e.g. @list. All indexing in Perl starts with 0, i.e. $list[0] (scalar) is the first element of @list.
  • Hashes are unordered sets of key/value pairs, where the value (scalar) is accessed using a key (string). A hash is prepended by '%', e.g. %colors. A value is addressed by using the key in braces, e.g to assign a value to %colors for the key 'ball':
    $colors{'ball'} = 'green';
  • Globs (or 'typeglobs') are symbol tables. They associate a reference to another variable with a global name. A glob is prepended by '*', e.g. *colors. The most common use of a glob is as a filehandle, such as open *FILE, $filename. Other uses include aliasing global variables, *colour = \$color.

All variables in Perl are of these four data types. Other data types are abstractions such as filehandles, subroutines, symbol table entries, etc. Perl keeps variables of each type separately, so that it is always clear which value you want to access. Example: @color, $color, %color, *color are different variables, so $color, $color[2] or $color{'ball'} hold different values.

Scope

Perl has two scopes which determine the visibility and accessibility of variables, 'global' and 'lexical'. In addition there are special modifications of variables within these scopes: 'local', 'our' and 'state'.


Global

A 'global' variable is associated with a 'package' (what other languages would call a 'namespace'). It is visible to all parts of the program. The full name of a global variable contains it's package and name like so:

   $Some::Package::foo = 42;       # a variable named "foo" in the package "Some::Package"

Within it's own namespace, the package part can be dropped.

   package Some::Package;
   $foo = 42;                      # equivalent to $Some::Package::foo = 42

If no package is declared the namespace is 'main'. Unless declared otherwise, all variables and subroutines are global. This is unfortunate as best practice is to use lexical variables whenever possible. File-scoped lexicals can replace most uses of global variables.


Lexical

Lexical scope is determined by the surrounding block. It is designed to encapsulate variable usage so only the very narrow portion of the code which needs access to that variable can access it. This makes code much easier to read and understand by reducing the code which could possibly effect a variable down to the block in which it is declared.

Any block will do: the block of a map, grep or sort routine; the braces around an if or else block or while loop. The braces containing a subroutine's code encompass a lexical scope.

   sub foo {           # beginning of a lexical scope
       my $var = 42;       # declaring a lexical variable $var
       print $var;         # prints 42
   }                   # end of a lexical scope
   
   print $var;         # $var is now undefined

There are exceptions to this rule, mostly for convenience. For example, the conditional of an 'if' or 'while' is considered to be part of the following block.

It is wholly apart from namespaces, a lexical variable has no namespace associated with it. A lexical scope can contain multiple namespaces and vice-versa. They also nest, the inner scope can see the outer scope's variables. For example...

   {                               # begin outer scope
       my $outer = "outer";            # declare lexical $outer in the outer scope
       
       {                               # begin inner scope
           my $inner = "inner";            # declare lexical $inner in the inner scope
           
           print $inner;                   # prints "inner"
           print $outer;                   # prints "outer"
       }                               # end inner scope
       
       print $inner;                   # $inner is now undefined
       print $outer;                   # prints "outer"
   }                               # end outer scope
   
   print $outer;                   # $outer is now undefined

Ultimately, there is an implicit lexical scope around the entire file. Any lexicals declared outside any enclosing braces are known as "file-scoped lexicals" and can be seen by the entire file from that point down. File-scoped lexicals often replace globals.

Variables declared lexically are automatically cleaned up when the scope is left if there are no more references to that variable.

Lexical variables cannot normally be accessed by code outside it's scope, although there are tricks to get at them so lexical variables should not be considered secure.

Lexical scope was introduced in Perl 5. It is best practice to declare your variables as lexical in the narrowest possible scope, rather than at the beginning of a routine as in C.

Operators

The list of operators is too numerous to fully reproduce here. Besides the common assignment operator '=' and the arithmetic operators "+ - / *", Perl has a large amount of complex operators, some of which may only work in conjunction with certain data types or constructs. As an example this piece of code:

open FILE, "/etc/passwd";  # open file
@line=<FILE>;              # read the complete file in one gulp
close FILE;                # finished

In this example a file is opened, the '<' and '>' around the file handle 'FILE' in the second line are one operator, telling Perl to read the file line by line and assigning each line to successive elements of the array @line. Thus, $line[0] will contain the complete first line of the file, $line[1] the second, etc.

Statements and Declarations

Statements are the parts of a script that are executed during runtime. They can be assignments, built-in functions, calls to subroutines, control structures, etc. Statements may be enclosed in blocks delimited by "curly" braces: '{' '}'. A block may stand by itself, but usually it is dependent on some controlling expression, such as an "if" statement, a loop, or the eval function. A block defines a new scope within a surrounding scope.

Perl statements are usually terminated by a semicolon, though there are other implicit means in which Perl knows a statement ends, such as the closing of a block.

The following statements are syntactically correct:


{ $a = 1 }       # legal because closing brace follows
$a = 1, $b = 2;
$a = 1; $b = 2;

Declarations are syntactically similar to statements but are evaluated during compile time, i.e. when the interpreter reads the script and creates its internal representation.

Unlike many programming languages, Perl does not require a variable to be declared. It will "spring into existence" (as a global variable) the first time it is used in a statement at runtime. But you can enforce the explicit declaration of variables by the "use strict" pragma (a compiler directive); any undeclared variable (in the scope of the pragma) will then produce a compile time error:


## preventing the accidental invocation of a variable is considered "good practise"

$server = 'Citizendium';  ## ok in non-strict surrounding scope
{                         ## start of new scope
  use strict 'vars';      ## pragma
  $page = 'Perl';         ## this will generate an error
  $main::page = 'Perl';   ## this is a legal global variable
  my $var = 1;            ## 'my' declaration makes it legal
}

A global variable must now be declared by using its full package name explicitly. But the main beneficiary is the "my" operator, and thus cleaner programming. By the gentle force of laziness a great number of local variables get declared instead of sloppily created globals!

Subroutines

Subroutines in Perl are declared by the reserved word 'sub', an (optional) name, and a body enclosed by braces. A named subroutine is global in its namespace. Unnamed ("anonymous") subroutines can obviously not be called elsewhere, but they have benefits for special and rather complicated purposes. The calling statement may pass parameters to the subroutine and will receive a return value. If the subroutine was declared before its first use, it can be called by its name, just like a builtin function, otherwise it has to be called with a prepended '&'. Examples:


sub tst1 { print $_,"\n"; }  ## subroutine declared before the call

tst1('tst1');                ## output 'tst1'
tst2('tst2');                ## error, not declared
&tst2('tst2');               ## output 'tst2'

my $var = sub {'hello'};     ## anonymous subroutine

sub tst2 { print $_,"\n"; }  ## subroutine declared after the call

Perl allows to declare details of function arguments in subroutine prototypes, e.g. whether the subroutine expects a scalar or a reference. But the name seems to be badly chosen because they don't work as compile-time type checking of function arguments, as programmers may expect[1].

Regular expressions

Regular expressions are a well-defined syntax by which a specialized program (the so-called regular expression engine) within Perl is directed to process text strings and produce certain side effects based on the findings. This process is usually called pattern matching. Regular Expression engines were built into many of Unix' standard programs from the beginning, but especially in the early days they differed slightly in their features, or sometimes even not so slightly.

When Larry Wall created Perl, he unified these flavors and added several "shorthand" patterns that made it easier to define and apply them. In later versions it became possible to comment patterns inline, to choose "non-greedy" match strategies, and Unicode and Posix were integrated into the engine. Perl itself (as opposed to the Regular Expression engine) provides several standard variables which will hold certain parts of the latest pattern match, such as $& for the match itself, $` for the part ("left" of) before the match, $' the part after it, $1, $2, ... for cached parts of a match, etc. These, together with some enhancements for readability, make it very easy and efficient to manipulate strings. As a result, for many years Perl's Regular Expression engine was considered the most advanced to be found in any programming language.

In Perl, the application of regular expressions can be very casual. The match operator '=~' also allows to match a variable "in place", i.e. it can be used as an assignment operator. In the above code snippet "$g =~ s/e/a/;", the contents of variable $g gets substituted (the 's' operator) by the result of the expression on the right hand side, applied on its original value. If $g contains 'geek', the first 'e' will be replaced by an 'a'. If you want to replace all 'e', the statement must be written as $g =~ s/e/a/g; (appended 'g' for "globally"). Here is a more explicit example, for editing convenience it is called as a script file 'tst':

# file 'tst'
$string  = 'Hello, world';
$pattern = '[ae]';           ## [] defines a class of characters, 'e' OR 'a' will match
$string  =~ m/$pattern/;     ## 'm' for 'match only', nothing is assigned to $string
print "match: \'$&\' before: \'$`\' after: \'$'\' \n";

Run as

$ perl tst

it will produce

match: 'e' before: 'H' after: 'llo, world'

For those interested in the use of Regular Expressions, [6] is highly recommended.

Namespace and scope

By declaring a package, a separate symbol table is created for all globals (variables, subroutines, etc.). This namespace continues until a new package is declared or the file ends. All symbols which are created outside of an explicit package automatically belong to default namespace 'main'. A variable or subroutine of a different namespace can be addressed by the '<namespace>::' qualifier. A scope is delimited by curly braces. Inside a scope a symbol from outside of this scope may be overlaid by "local" or lexical variables. As soon as the program leaves this scope, the previous scope is valid again.
An example:

## program tst
$var = 'global';                      ## global: 'main' assumend
{
  package test;
  $var         = 'package global';
  local $local = 'local';
  {                            ## new scope
    my $var = 'private var';   ## will not exist outside
    print "scope: $var \n";
    {
      print "subscope: $var local: $local \n";
    }
  }
  print "test: $test::var main: $main::var var: $var local: $local \n";
}
print "test: $test::var main: $main::var var: $var local: $local \n";

Running this:

$ perl tst

Produces:

scope: private var
subscope: private var local: local
test: package global main: global var: package global local: local
test: package global main: global var: global local:

Modules, objects

Although the package declaration itself goes back to the earlier days of Perl, many of the features introduced in version 5 gave it a whole new significance. It had become much easier to isolate parts of a program from one another and even assign them object character by "blessing" them into the package, automatically changing the package into a class. All subroutines inside this package become "methods", even their call behavior changes significantly. Of course, purists are not entirely happy with the way Perl has implemented some of the classic features, but looking at the large number of modules implementing Perl's flavor of object oriented programming, it can't be so wrong.


Notes and references

  1. raised by Tom Christiansen, discussed here: Perl's Sins)