Implementing SIP Telephony in Python
Author’s Remarks: This document is still “work in progress”. Please revisit this site later to see the completed text. I plan to provide this as a free document accompanying my open-source software of the 39 Peers project.
Copyright: All material in this document is © 2007-2008, Kundan Singh. See next page for details.
All material in this document is © 2007-2008, Kundan Singh. You need explicit written permission from Kundan Singh <mailto:email@example.com> to copy or reproduce (full or part of) the content of this book.
Some text from IETF RFCs and Internet-Drafts are reproduced in this document to explain or assist in their implementation in accordance with the copyright notice in those RFCs and Internet-Draft. The copyright notice of those RFCs and Internet-Drafts is as follows:
Copyright (C) The Internet Society (2002). All Rights Reserved.
This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it
or assist in its implementation may be prepared, copied, published
and distributed, in whole or in part, without restriction of any
kind, provided that the above copyright notice and this paragraph are
included on all such copies and derivative works. However, this
document itself may not be modified in any way, such as by removing
the copyright notice or references to the Internet Society or other
Internet organizations, except as needed for the purpose of
developing Internet standards in which case the procedures for
copyrights defined in the Internet Standards process must be
followed, or as required to translate it into languages other than
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
This document and the information contained herein is provided on an
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
In this part you will get a step-by-step implementation guide to various protocols such as SIP, SDP and RTP. It covers essential protocol suites described in RFC3261, RFC4566, RFC3264, RFC2617 and other related RFCs.
We will start with parsing and formatting of SIP addresses, then describe the parsing and formatting of SIP messages and its components. Then we will build a SIP stack API with other control functions for transaction, dialog, etc. Then we add digest authentication and other security mechanisms in our implementation. At the end of this part, you will understand how to implement the basic SIP protocol suite without worrying about client or server specific components such as media or proxy. A major part in SIP telephony implementation deals with parsing, formatting and various state machines for transactions and dialogs – which appear for both client as well as server implementations.
Implementing URI as per RFC 2396 and SIP address
A number of aspects in SIP and related protocols use various forms of addresses. The URI or Uniform Resource Identifier is one such class of address and is defined in RFC 2396. Some example URIs are shown below:
Besides a URI, a SIP implementation also needs to deal with SIP addresses. A SIP address contains a user’s display name as well as a URI as shown below. Naturally, a SIP address is a super-set of an URI as far as data information is concerned:
Dealing with addresses requires three functions: parsing, formatting and accessing the fields. We create a new module named rfc2396 to implement these functions.
Let’s assume a URI class that represents a URI. We would want the objects of URI type to be able to interoperate with strings such that it can be parsed from a string or formatted into a string. We would also want to access the properties of the object such as scheme, user, password, host and port.
The object should expose the headers and parameters of the URI as well. Finally, we would want an equality test operation so that two URIs can be compared. Note that a URI comparison uses case-insensitive values for certain fields.
Let’s start by defining the URI class.
To construct this object from a string, we need to parse the string. It is possible to build a regular expression that captures most (but not all) forms of URI representations. In the simplest form the regular expression should be able to extract the scheme, user, password, host, port, parameters and headers from the string. Based on the allowed values for various parts, we can construct the regular expression as follows.
Once a string is passed using this syntax, we can extract the various groups into appropriate properties such as scheme, user, password, host, port, params and headers so that these properties are available as object properties to the programmer.
Now that the parsing is completed, we need to take care of the boundary cases. For example, if the string represents a “tel:” URI then actual telephone number will be in the host property whereas semantically it makes sense to put the phone number in the user property.
If the port number is empty or missing, then the port property should be set to None instead of an empty string. If the port number is valid then the port property should be a number instead of a string representing the number.
Instead of storing the parameters and headers as single string variables named param and header, it is more convenient to create associative array for the param indexed by parameter name, and an array for header with list of header values. This allows us to access the parameters as u.param[‘transport’] to access the “transport” parameter, and headers as header to access the fourth header in the zero-bound array index.
Extracting headers string into an array is easy by splitting across “&” to get individual headers of the array.
To extract the parameters string into an associative array or dict, we need to first split across “;” then for each such string take the left side of “=” as parameter name and right side as parameter value. Note that to allow “=” in the parameter value, we cannot use split on individual strings; instead we use partition. Once we have split values, we can construct the dict using array of (name, value) tuples.
Once we are done we the parsing of the string into individual properties of this object,we construct the full constructor function by doing error checking – for the case when the value supplied is empty to construct an empty URI object.
To format the URI object as a string, we just create a string placing the individual properties appropriately. We implement the __repr__ instead of __str__ so that the implementation will be available to hash indexing as well. The tricky part is to construct the parameters string from the param property and the headers string from the header property. Luckily the language feature facilitates such conversion easily as shown below.
We follow a convention to implement the dup function for simple data objects, similar to clone function in java, which allows duplicating an object instance. Instead of deep copying individual properties, it is easier to just convert the object to string and back to the object for duplication.
Comparing two URI objects require us to implement two functions: __hash__ and __cmp__. This gives complete order to the URI objects and hence these objects can be used as index in a table. To return the hash code for the object, we can convert it to lower-case string and return the hash of that string. Similarly, to compare the two URI objects, we can convert them to lower-case strings and compare them. Although, only certain fields in a URI specification are case in-sensitive, in our basic implementation we assume all fields are case insensitive. TODO: this may lead to some interoperability problem.
As a convenient method, we can provide a read-only property to extract the (host, port) tuple from the URI. This method allows us to keep host-port as a separate data type without having to look inside the tuple. For example, u.hostPort allows access to the tuple via the hostPort property.
Finally, the last property of interest is the secure property. Several URI schemes such as “sips”, “https” refer to the secure version of the URI scheme. Having this property allows the application to set or inspect the security level without having to know the various schemes that apply to secure URI. The following implementation as a limitation that it works only for “sips” and “https”, but can be easily extended to support other protocols. Unlike other read-write properties, the secure property is unique in that once set to True it can not be reset to False. Because of this uniqueness it is desirable to have a separate processing in this property instead of just storing the property as a flag.
Now that we have implemented the URI class we can do some basic testing.
As mentioned before a SIP address contains a display-name and a URI. Please refer to RFC 3261 for details on the specification. It is used in various places in a SIP message, e.g., To, From, Contact headers.
We implement the SIP address using the Address class. What makes the implementation challenging is the presence of zero of more white spaces within the SIP address, optional quotes around the display name, and optional display-name property. Note that the parser should understand the following forms:
Likewise, there are three regular expressions to parse the address as shown below. Parsing routine can match against any of these to identify the string as a valid SIP address. The first one has no quotes around display name, the second one has quotes around display name and the third one has empty display name.
Let’s define a method called parse to parse an address from string. This method matches the value against each of the above regular expressions, and if a match is found, then it extracts the display-name and URI after stripping the white-spaces.
The above method needs to be modified to accommodate certain conditions. For example, SIP defines a special SIP address of value “*” which can be present only in the Contact header. Secondly, a SIP message parsing routing will need to know the parts after the SIP address in various headers, e.g., the header parameters of To after parsing of the SIP address of To header. Thus, we return the number of characters parsed in this routine, so that the caller can continue beyond the SIP address.
Once we have the regular expression to parse, the constructor becomes straightforward.
Note the two special properties: wildcard and mustQuote. The wildcard property is used to indicate that the address represented by this object is a special “*” address, and the mustQuote property controls whether the string representation must have quoted URI even if the display name is absent.
Constructing the string representation is straightforward, as it puts the display-name and URI appropriately in the resulting string. The URI itself is represented to a string using it’s __repr__ method.
Similar to the URI class, the Address class also has the dup method to clone the object.
In a real-implementation of a client, sometimes it is necessary to extract the display part of the address. The specification says that the display name is optional. In such cases, the implementation uses the user part of the URI as the display text. Nevertheless, it is handy to provide a read-only displayable property that extracts the displayable user name from the address using some built-in criteria. The following property definition uses the first 25 characters of the display-name, user part or host part, whichever is present first in that order.
Now that we are done with the basic implementation of the Address class, we can perform some simple tests.
Sometimes the processing depends on the type of address, whether the address is an IPv4 or IPv6 address. A function such as the following provides a ready-to-use utility for such checks. A simple technique is to invoke the inet_aton function on the data to know if the data is valid IPv4 or not. Instead of doing a socket call one could alternatively parse the data into individual numeric values and check the values.
To test the function you can invoke the following:
A similar test can be done for multicast addresses as follows. A multicast address, for our implementation, is an IPv4 address for which the first four most significant bits are 0111.
The test can be done as follows:
Now that we have looked at various aspects of parsing and formatting SIP addresses and URIs, we can move on to the actual SIP message parsing and formatting and eventually implementation of a complete SIP stack in the next chapter.
Implementing core SIP as per RFC 3261
We have already seen how to implement the addressing module. This chapter describes the implementation of the SIP module named rfc3261. We continue with the parsing and formatting methods from the addressing module to the SIP message structure. After describing the parsing and formatting, we move on to building a SIP stack.
An example SIP message is shown below.
The first line is a request or response line, which is followed by header lines, and finally the message body. You can identify the SIP addresses and URIs in various parts of this message such as request-URI on the firt line and the values of To, From and Contact headers.
To encapsulate a SIP message we define a class Message. To encapsulate individual header we define a class Header. We take the bottom-up approach of first parsing and formatting a Header and then defining various methods in a Message.
As with addresses, we would want dynamic attributes in Message object to represent the various headers. Similar the Header can have dynamic attributes for the parameters. Some desired operations are shown below.
From the structure point of view, there are four types of SIP headers: standard, address-based, comma-included and unstructured. Most SIP headers are defined to be standard headers that have a value and zero or more parameters separated by a semi-colon. The address-based headers have the value consisting of a SIP address, but the parameters are similar to the standard headers. The difference arises because the value of an address-based header can internally have “;” whereas those are forbidden in the value of the standard header. For example, URI parameters are also separated by “;” within the value of an address-based header. A comma-included header is the one that can have a “,” in the value of the header. Normally a standard or address-based header can have multiple header values in the same header line, where the values are separated by comma “,”. However, for a comma-included header such as WWW-Authenticate, there can be only one value per header line, and the intermediate comma “,” are part of the value. The comma-included headers are only used because of interoperability with existing HTTP headers for authentication, which are comma-included. The unstructured headers have one value per header line and the value is treated as opaque string without any structure internally. An example is Call-ID header.
The specification defines parsing rules for various headers, which allow us to classify them among these categories as follows. Any header name that is not covered in the following three categories is assumed to be a standard header.
An extension to SIP can define new header names in these categories. By default we assume standard header if not found in the list above.
From RFC3261 p.32 – SIP provides a mechanism to represent common header field names in an abbreviated form. A compact form MAY be substituted for the longer form of a header field name at any time without changing the semantics of the message. A header field name MAY appear in both long and short forms within the same message. Implementations MUST accept both the long and short forms of each header name.
Besides these categories needed for our implementation, the specification also defines the short-form of header names as follows.
In the above listing, we have used the lower-case header names so that comparison can be done consistently. For the purpose of canonical representation and formatting of headers, as well as comparison of two header names, the standard defines canonical representation of various header names. In particular, the header names in canonical form have one or more words joined together by a dash ‘-‘ with the first letter of each word capitalized. The following statement can convert a lower-case header name to its canonical form.
There are three exceptions to this rule as shown below.
To facilitate canonicalization of header names, we define a function that first converts the name to lower case and then canonicalizes it keeping the exceptions and short-forms in mind.
The method can be tested to produce canonical representations of various header names, existing or future.
Another utility functionality we need is to quote and unquote a string. The parameter values in a header can be quoted, whereas we store unquoted value in our object. The following functions allow us to quote or unquote a string as applicable.
A SIP header contains a header name and a header value. There can be any number of header attributes. The attribute name need not be known in advance. We define our class such that name and value properties refer to the header name and value, whereas the header object itself can be used as an associative array to extract the attribute value indexed by the attribute name, e.g., h[“tag”] or h.tag.
To parse a header value, we define a method that takes the header name and based on the type it invokes different parsing logic.
For an address-based header, the returned value is an object of type Address which stores the address part of the value. The rest of the string is parsed for sequence of header parameters separated by semi-colon “;”, and stored as attributes of the local Header object. The Address is set to always use quotes for the URI while formatting. This is important to prevent missing quotes which causes the URI parameters to be treated as header parameters after formatting. Note that the parameter name is considered to be case in-sensitive.
For a standard header, the returned value is the string up to the semi-colon “;” if any, otherwise the whole value string. If the parameters are present indicated by the semi-colon “;” then they are parsed into this Header object as before. TODO: we need to check if the parameter name and/or value are tokens.
A comma included header is usually of the form “value param1=value1, param2=value2,…” For programming convenience, we return the value part as the value of the header and store the individual param-value pairs in the local Header object as an associative array.
After the parsing is completed, we may want to inspect some of the unstructured header values and store the values in a more structured form. For instance, the CSeq header value has two parts: the number and the method name.
For programming convenience, we can store the individual parts separately as follows. We also canonicalizes the value so as to remove more than one spaces between the number and the method name, if needed.
Now that we have completed the parsing step, we can create the constructor which takes an optional string for the value. The constructor removes any extra surrounding white-spaces from the value before parsing it. The header name is converted to lower-case if applicable.
Formatting a Header object has two semantics: either you can format only the value or the complete header line. We implement two different methods, str and repr, to achieve these functions. Correspondingly, the object can be converted to string in different contexts differently.
The value is formatted as follows. If the header type is comma-included or unstructured, then the value property is the actual value string representation which can be returned, otherwise the parameters (or rest) needs to be appended to the value. When appending the parameters, all indices from the local associative array are used except for pre-defined indices of name, value and _viauri.
The repr method just returns the header “name: value” where value is formatted using the str method.
Besides the parsing and formatting methods, there are other utility methods needed for a Header object. A dup method is used to clone the object by formatting and parsing back into a new object.
The parameter access can be done either by container syntax (such as h[“tag”]) of attribute access syntax (h.tag). This gives more flexibility to the application developer. As noted earlier, we store the parameters in the local __dict__ property which readily allows attribute access syntax. To add the container access syntax we add the following methods.
The Via header is unique, in the sense that even though it is a standard header there is lot of structure inside the value part of the header.
The viaUri property represents a URI object derived from the Via Header object such that the URI represents the address to which we need to send a response. RFC3261 specifies the process to derive such a URI. First we separate the header value “SIP/2.0/UDP pc33.home.com” into the first and second parts. The first part gives us the type: udp, tcp or tls.
The second part can be used to construct a new URI object with no “user” part, a default “transport” parameter derived from the type and the “scheme” of “sip:” The URI gets stored internally.
A default port number of 5060 is assumed if missing in the URI.
If the rport parameter is present in the header, the URI port is changed to the rport value if present, and not changed if rport value is not present.
If the type is not a reliable transport type, and maddr parameter is present then the URI host is changed to maddr value, otherwise if received parameter is present then URI host is changed to received parameter value.
We implement this function using the viaUri property as follows.
Before continuing it may be worthwhile to test our function for correctness.
From RFC3261 p29 – HTTP/1.1 also specifies that multiple header fields of the same field name whose value is a comma-separated list can be combined into one header field. That applies to SIP as well, but the specific rule is different because of the different grammars. Specifically, any SIP header whose grammar is of the form
header = "header-name" HCOLON header-value *(COMMA header-value)
allows for combining header fields of the same name into a comma- separated list.
Each header field consists of a field name followed by a colon (":") and the field value. The formal grammar for a message-header allows for an arbitrary amount of whitespace on either side of the colon.
The SIP message parser need to first divide the headers portion into individual header lines, extract the header name and value from the header line and finally invoke the Header constructor to construct individual header object. If a single header line contains multiple comma “,” separated header values, then those individual header values need to be constructed independently.
We define a class method to perform this task. The method takes a string and returns a tuple with two values: the first is the header name, and the second is an array of Header objects.
This can be tested as follows.
Now that we have implemented the class, we can do basic testing as follows:
A Message object is a representation of a SIP message. Unlike other SIP stacks that define various individual message types, separate classes for first line, etc., in Python we use dynamic properties to easily implement those features. In particular, the same class implements both the request and response. The attributes such as method or response.are valid for request or response, respectively.
Even though we would like to have the attribute syntax for accessing the header from a message, certain header names cannot be a Python attribute name. For example, “Content-Length” with an embedded dash cannot be an attribute. Thus, we implement both the attribute as well as container access for the headers in a message. The header names are case in-sensitive. Accessing the header that doesn’t exist in the message gives None instead of exception. This creates cleaner source code, instead of having to catch exceptions everywhere. The following definition allows us to create a generic object that can hold name value pairs and allow access using both attribute access and container access syntax.
There are certain pre-defined attributes: method, uri, response, responsetext, protocol and body.
A SIP header can be either a single-instance header or a multiple-instance header. There are only a few single-instance headers. By default, a header is treated as multiple-instance. The difference between single and multiple instance headers is that we can expose a single Header as the value of a single instance header, whereas we expose an list of Header objects as the value of a multiple instance header.
From RFC3261 p.27 – The start-line, each message-header line, and the empty line MUST be terminated by a carriage-return line-feed sequence (CRLF). Note that the empty line MUST be present even if the message-body is not.
Parsing a SIP message is an important method in the Message class. The first step in parsing a SIP message is splitting the string across “\r\n\r\n” so that the second part becomes the message body text, and the first part contains the first line as well as the headers.
From RFC3261 p.28 – SIP requests are distinguished by having a Request-Line for a start-line. A Request-Line contains a method name, a Request-URI, and the protocol version separated by a single space (SP) character.
Request-Line = Method SP Request-URI SP SIP-Version CRLF
SIP responses are distinguished from requests by having a Status-Line as their start-line. A Status-Line consists of the protocol version followed by a numeric Status-Code and its associated textual phrase, with each element separated by a single SP character.
Status-Line = SIP-Version SP Status-Code SP Reason-Phrase CRLF
After splitting the message string into two parts, the first part is further split into the first line and the headers text.
The first line can be either a request line or a response line. This can be identified by splitting the first line into three parts across a white-space character and checking if the second partition is an integer or not? If it is an integer (i.e., a response code), then the first line is a response line, otherwise it is a request line. The properties response (of type int), responsetext and protocol are set for a response line and the properties method, uri (or type URI) and protocol are set for a request line.
SIP header fields are similar to HTTP header fields in both syntax and semantics. In particular, SIP header fields follow the HTTP definitions of syntax for the message-header and the rules for extending header fields over multiple lines. However, the latter is specified in HTTP with implicit whitespace and folding. This specification conforms to RFC 2234 and uses only explicit whitespace and folding as an integral part of the grammar.
After the first line is parsed, the headers text is split into individual header lines. Note that a header line that starts with a white-space character is a continuation of the previous header line. Let’s not worry about the continuation line for now.
To parse the header line we use the createHeaders class method in the Header class, which returns a tuple with two elements: the header name and the array of Header objects indicating individual header values. The header values are stored in the local object indexed by the header name, with value as either a single Header object or a list of Header objects. Any error while parsing the header line is ignored and we continue to the next header line.
From RFC3261 p.33 – Requests, including new requests defined in extensions to this specification, MAY contain message bodies unless otherwise noted. The interpretation of the body depends on the request method.
For response messages, the request method and the response status code determine the type and interpretation of any message body. All responses MAY include a body.
The body length in bytes is provided by the Content-Length header field.
Once we have parsed the headers text into individual header elements, we extract the message body. SIP defines a Content-Length of 0 if that header is missing. Once the body is stored in the body property, we validate the body length and throw an exception if there is a mismatch.
As the last step in the parsing process, we check if the mandatory headers are present or not.
There are a number of boundary conditions that we need to implement, but haven’t implemented so far. Examples are (1) if the message doesn’t contain “\r\n\r\n” sequence than the message body should be assumed to be empty, (2) should parse as a valid message even if there are no headers, because the application can add headers later, (3) should throw an error if the first line has less than three parts, (4) should validate the syntax of the protocol property, (5) the first header should not start with a white-space, (6) the method and response properties should be validated, (7) the syntax of top-most Via header and fields such as ttl, maddr, received and branch should be validated.
Once we have implemented the parsing method, we can build the constructor that takes the optional message string to parse.
The formatting of the SIP message is simpler than parsing. The idea is to construct the first line followed by individual header lines and finally append the message body. The Message object allows iteration over the associative array index, where the iteration walks over all the Header objects.
Cloning a message is similar to earlier data structures – format to string and parse the string back into another Message object.
As mentioned earlier the attribute and container access can be used to refer to or add a particular header. However, there are some additional convenient methods that we would want to implement to access the headers. The iteration over the object should return each header in turn. This is implemented by flattening the headers into a single list, and returning the iterator on that list.
The method first returns the first occurrence of a particular Header object from the header name. If the header doesn’t exist then it returns None. This method can be used when a header object is needed in a singular context, irrespective of whether the header is a single or multiple instance header.
The method all returns a list of all the Header objects from the given header name. Event if the header type is single-instance, it returns a list containing single element. This method is useful when accessing the header object in a list context irrespective of whether the header is a single or multiple instance header. The method is further extended to accept a list of header names and return all the Header objects associated with all those names. Thus, h.all(“To”,”From”,”CSeq”,”Call-ID”) will return a list of all those mandatory headers. If no such headers are found, then it returns an empty list, instead of None. Thus the return value can always be evaluated in a list context.
The method insert can be used to insert a particular Header in a Message. The application doesn’t have to worry about whether it is a single or multiple instance header and how many occurrences exist in the message. An optional flag allows appending the header instead of inserting at the beginning. TODO: we should not insert multiple instance header if the header name indicates a single instance header type.
We implement a read-write property named body, which refers to the message body. When the body is explicitly set, we also update the Content-Length header value so that the message’s content length remains consistent.
We implement various read-only properties, is1xx, is2xx, etc., to indicate whether a Message object represents a response of that particular response class. Finally, a isfinal property indicates whether the message is a final response or not. Python allows us to dynamically create methods and properties as follows.
Instead of having the application create the Message object and populate the fields, it would be better to define the factory methods to create different types of messages. We implement two class methods, createRequest and createResponse, that can be used by the application to create a request or response Message, respective, by supplying appropriate parameters. The use of these methods ensures that the created object will be valid. For example, the uri property is actually a URI for a request, and the protocol property actually stores “SIP/2.0”.
Before defining those methods, let’s define a populateMessage method that updates the Message object with the supplied headers and message body content. If no message body is specified, then it resets the Content-Length header value.
Now to create a request, we create the Message object, and populate the method, uri, protocol properties. Then the optional headers and message body are populated. Finally, the CSeq header, if present, is updated with the correct method name. This allows us to create a new request from the headers of an existing request, and let this method update the headers accordingly.
A response can be created by supplying various properties. Optionally, the original request can be supplied as well. If the original request is provided, then the response uses the To, From, CSeq, Call-ID and Via headers from the original request. As per RFC3261, if the response code is 100, then the Timestamp header is also copied from the original request. If optional headers are provided, then those are used to override the previously assigned headers if needed.
The From field of the response MUST equal the From header field of the request. The Call-ID header field of the response MUST equal the Call-ID header field of the request. The CSeq header field of the response MUST equal the CSeq field of the request. The Via header field values in the response MUST equal the Via header field values in the request and MUST maintain the same ordering.
If a request contained a To tag in the request, the To header field in the response MUST equal that of the request. However, if the To header field in the request did not contain a tag, the URI in the To header field in the response MUST equal the URI in the To header field.
At this point, we have seen how to parse and format a SIP message and how to define various data structures for easy access of the properties in a message or its header. In the next section, we detail the implementation of a SIP stack, including various layers as per the specification.
Although there is no good definition of a SIP stack, we use the following block diagram to decompose the core SIP implementation components which we refer to as SIP stack. As shown in the diagram, a SIP stack consists of these components: UserAgent/Dialog, Transaction, Message and Transport. The Transport, Transaction and UserAgent/Dialog layers are defined in RFC3261.
From RFC3261 p.18 – SIP is structured as a layered protocol, which means that its behavior is described in terms of a set of fairly independent processing stages with only a loose coupling between each stage.
Not every element specified by the protocol contains every layer. Furthermore, the elements specified by SIP are logical elements, not physical ones. A physical realization can choose to act as different logical elements, perhaps even on a transaction-by-transaction basis. The lowest layer of SIP is its syntax and encoding. Its encoding is specified using an augmented Backus-Naur Form grammar (BNF).
The second layer is the transport layer. It defines how a client sends requests and receives responses and how a server receives requests and sends responses over the network. All SIP elements contain a transport layer.
The Message layer defines the SIP message parsing and formatting as per the specification, and as we saw in the previous section. Although the specification keeps the transport layer above the syntax and encoding layer, we keep the implementation of the syntax and encoding layer (in the form of Message layer) above the transport layer. This is needed for implementations which treat the transport as the socket layer of the operating system with some methods to perform SIP related functions. This also helps us in moving the transport layer to an external entity such that the SIP implementation becomes independent of the actual transport layer.
The third layer is the transaction layer. Transactions are a fundamental component of SIP. A transaction is a request sent by a client transaction (using the transport layer) to a server transaction, along with all responses to that request sent from the server transaction back to the client. The transaction layer handles application-layer retransmissions, matching of responses to requests, and application-layer timeouts. Any task that a user agent client (UAC) accomplishes takes place using a series of transactions. User agents contain a transaction layer, as do stateful proxies.
The layer above the transaction layer is called the transaction user (TU). Each of the SIP entities, except the stateless proxy, is a transaction user. When a TU wishes to send a request, it creates a client transaction instance and passes it the request along with the destination IP address, port, and transport to which to send the request. A TU that creates a client transaction can also cancel it. When a client cancels a transaction, it requests that the server stop further processing, revert to the state that existed before the transaction was initiated, and generate a specific error response to that transaction.
Certain other requests are sent within a dialog. A dialog is a peer-to-peer SIP relationship between two user agents that persists for some time. The dialog facilitates sequencing of messages and proper routing of requests between the user agents.
A SIP client can be built on top of the UserAgent/Dialog layer whereas a SIP server can be built at various layers depending on the features in the server – load-balancing server, transaction stateless proxy, transaction stateful proxy, registrar, call stateful server.
The Stack layer represents the general processing module that needs to interact with all the other layers. We describe the individual layers in detail below.
For the purpose of this implementation, we assume that the actual transport is outside our module, rfc3261. This allows us to implement the core SIP functions without worrying about the network transport layer. As a side-effect, we do not have to worry about the process model, whether it is event-based or thread-pool, because the transport layer usually controls the messaging and architecture. This step is tricky and it is important that you pay attention to the details here to understand the implementation.
The Stack later is the main interface for our SIP implementation. In a SIP implementation, typically we listen on a transport address for incoming packet. When a packet is received, it is parsed and depending on the message it gets delivered to either the transaction, user-agent or dialog layer. The Stack module receives a message from the external transport and delivers it appropriately. The individual modules such as transaction, user-agent and dialog can have their own timers. When these timers expire the module takes certain actions, such as retransmitting a response or a request. We again use the Stack module to deliver messages to be sent to the transport. When the application wants to send a request or a response, such as SIP registration or call answer, it uses the Stack module to initiate the outbound request or response processing. Eventually, the Stack layer delivers it back to the external transport layer for actual transport of the message. Thus, the Stack layer is the sole interface in and out of our SIP implementation.
The application may listen on multiple transport addresses, e.g., one for UDP, and one for TCP, or multiple UDP ports. We simplify our design by assuming that each instance of the Stack object is associated with a single instance of the listening transport. The Stack needs some information about the associated transport for processing various SIP functions. This information can be encapsulated in an object and supplied to the Stack on construction. An example of such information object is defined below.
We assume that the external function getlocalsock returns the (ip, port) tuple for the locally bound socket sock. The actual details of this data object is not important, but what is important is that the data object should hold these properties: host as the dotted local IP address, port as the listening port number, type as one of “udp”, “tcp”, “tls” or “sctp” to indicate the transport type, secure as Boolean indicating whether the transport is secure or not, and reliable and congestionControlled as Booleans indicating whether the transport is reliable and congestion controlled, respectively, or not.
Such a data object is supplied to the constructor of the Stack object. The application also needs to install itself as a listener of the events from the Stack object, so that it can know about incoming call or successful call events, etc.
Here, the transport argument is of type TransportInfo or something similar, and the app argument is a reference to the application. The Stack object invokes various methods on the app object. In particular the app object must implement several interface methods: send, sending, createServer, receivedRequest, receivedResponse, cancelled, dialogCreated, authenticate and createTimer. All these interface methods take the last argument as a reference to the Stack object that is calling the method. This allows the application to use multiple stacks, e.g., one for UDP and another for TCP transport.
To send some data on the transport to some destination, the Stack object calls the app.send method with first parameter as the data string to be sent and the second parameter as a host-port tuple, e.g., (‘22.214.171.124’, 5060). Thus the application must implement the following method.
When the Stack receives an incoming request and needs to create a UAS (user agent server), it invokes the app.createServer method with first argument as the request Message and second as the URI from the request line. The application must implement the method and return either a valid UserAgent object if it knows how to handle this incoming request, else None if it does not know how to handle this incoming request. For example, a client implementation will typically return None for a REGISTER request.
The Stack invokes receivedRequest and receivedResponse methods on app to indicate incoming request or response associated with a UserAgent. The first argument is the UserAgent object reference, and the second is the Message representing the request or response.
The Stack invokes the app.sending method to indicate that a message is about to be sent on a UserAgent. This method gets invoked before doing any DNS resolution for destination address, whereas the app.send gets invoked to actually send a formatted message string to the destination address. The sending method gives an opportunity to the application to inspect and modify the Message if needed before it is sent out.
If an incoming request, typically INVITE, is cancelled by remote endpoint and the Stack receives a CANCEL request, then it invokes app.cancelled instead of app.receivedRequest. The second argument is the original request Message which was cancelled. This allows the Stack to handle the CANCEL internally, while still inform the application about the cancellation of the original request.
Sometimes the Stack needs to convert an existing UAS or UAC to a SIP dialog, e.g., while sending or receiving 2xx response to INVITE it creates a Dialog out of UserAgent. The application might have stored a reference to the original UserAgent object. The stack invokes the app.dialogCreated method to inform the application that a new dialog has been created from the previous UAC or UAS, and the application can then update its reference.
In our implementation the Dialog class is derived from the UserAgent class so that it can reuse a number of member variables and certain methods.
When the Stack receives a 401 or 407 response for an outbound request, it tries to authenticate and retry the request. To do so, it needs the credentials from the application. It invokes the app.authenticate method to get the credentials from the application. The application should populate the authentication username and password properties in the second argument, which is an object. If the credentials are known and populated in the object, the application returns True, otherwise it returns False.
Finally, the last interface method is for creating a timer. Since the core SIP implementation is independent of the thread or event model, whereas the timers in the SIP state machines require the knowledge of the model, we tried to remove this dependency by using the interface method, app.createTimer.
This method must return a valid application defined Timer object. An example Timer class is pseudo-coded below. The constructor takes an object, on which the timer callback timedout is invoked. The Timer object provides a start([delay]) and stop() methods to control the timer. The delay property stores the delay supplied in the last call to start so that subsequent calls without an argument can reuse the previous value. The delay is supplied in milliseconds.
While going through this section you might have felt that the interface is very complex. Please trust me on this – given the requirement of keeping the thread-event model outside the SIP implementation, this is a very clean and well documented interface. There is some complexity because of the added flexibility requirement. Keeping the thread-event model outside the SIP implementation allows us to implement and experiment with various threading models for performance evaluation. Moreover, my source code has example client which implements these interface methods along with the Timer class.
Now that we have described the application interface from Stack to the application, let’s look at the methods exposed by the Stack which can be invoked by the application. A simple application will typically need to create a Stack object and invoke the received method whenever any data is available on the associated transport for this stack. Occasionally the application may need to access the URI representing the listening point for this Stack, so that it can construct other addresses, e.g., Contact header. Note that you must create a new URI if needed instead of modifying the uri property.
Let’s define the Stack class. We maintain several properties in the Stack. Each stack has a list of SIP methods, serverMethods, that are supported on the incoming direction. The tag property stores a unique tag value that gets added in various To and From headers. The stack also maintains two tables: one for all the Transactions and other for all the Dialogs, which have been created.
As mentioned before the constructor takes a reference to the application to invoke the callback and a reference to the transport information object
The destructor is analogous which cleans up all the references to dialogs and transaction. We keep an internal property named closing to know whether the stack is being closed. Certain operations are not done on a stack that is closing, e.g., a new send request should be ignored.
The uri property represents the local listening transport. The URI scheme is “sips:” for secure transport and “sip:” for non-secure. The host and port portions are derived from the original transport information supplied in the constructor.
The stack also provides an internal method to create a new Call-ID based on some random number and local transport host name. Usually the application doesn’t need to use this method. This is used by other modules in the stack implementation to create a new call identifier.
Similarly, the stack has an internal method to create a new Header object representing the Via header. In particular it uses the transport information to populate the SIP transport type, listening host name and port number. The rport header parameter is put without any value.
From RFC3261 p.39 – When the UAC creates a request, it MUST insert a Via into that request. The protocol name and protocol version in the header field MUST be SIP and 2.0, respectively.
To send a message to the transport via this Stack, the other modules invoke the send method with the data. The destination address is supplied if available, but may be derived from the message itself if needed. If the destination address is supplied, it must be either a URI or a host-port tuple. If it is a URI, then the host-port tuple is derived from the host and port portion of the URI. If port number is missing, then default SIP port number is used. If the data to be sent is a Message object, then it is formatted into a string. Once we have the data string and destination host-port tuple, we can invoke the application’s send method to actually send the data using the associated transport. TODO: why is transport supplied?
We need to do some additional processing on the SIP message based on the specification before it is sent out.
From RFC3261 p.143 – A client that sends a request to a multicast address MUST add the "maddr" parameter to its Via header field value containing the destination multicast address, and for IPv4, SHOULD add the "ttl" parameter with a value of 1.
Secondly, if a response needs to be sent and no destination address is supplied, then we use the viaUri property of the top-most Via header of the response, to decide where to send the response. Please see the description on how viaUri property is generated earlier in this chapter.
When the application receives a message on the transport, it must invoke the received method on the stack to supply the received message to the stack. The source address in the form of host-port tuple is also supplied by the application. This is the only method that the application needs to invoke on the incoming direction of a message.
In this method the stack parses the received data string into a SIP message. The message is then handed over to different methods for handling a request or a response. TODO: we need to send a 400 response if there is parsing error for a non-ACK request, and response can be sent.
For a received request, there are some additional checks that need to be done. In particular, it needs to check if a Via header exists, and if the top-most Via header is different from the source address, then it needs to update the top-most Via header with the received and rport attributes correctly.
When an incoming request is received, we check if a matching transaction exists or not for this request. This is done by invoking the findTransaction method with the transaction identifier derived from the branch parameter of the top-most Via header, and optionally the request method. Usually the transaction identifier for CANCEL and ACK are different than the original INVITE. Hence we need the request method to distinguish between the two cases.
One special case is when the request method is ACK and the branch parameter is “0”. Some existing implementation, such as “iptel.org” service, always puts a branch parameter of “0” in the ACK. Thus if there are multiple previous transaction’s ACK, the new request will match the previous transaction’s ACK. Either we need to fix the code to handle end-to-end ACK correctly in findTransaction, or we can do a work-around of not matching an ACK with branch as “0” to a transaction.
If a transaction is found, the request is delivered to the transaction object for further processing, and the stack module doesn’t need to care about this request anymore.
If no matching transaction is found, then a new server transaction needs to be created to handle the request. The creation of server transaction can be done by the dialog layer if the request is associated with an existing dialog, or by the UserAgent itself. If a new server transaction cannot be created for some reason, it sends a “404 Not Found” response to the source via the transport layer if the request is not ACK.
If the request is a CANCEL request, then the original INVITE transaction is searched for the branch parameter. If an original transaction is found, then the new server transaction is created out of the user of this original transaction object, which could be a UserAgent or Dialog object associated with that original transaction. If no original transaction is found, then appropriate response is returned via the transport layer.
If the request is not CANCEL and a tag parameter is present in the To header indicating that this request belongs to an existing dialog, then we search for an existing matching dialog. If a matching dialog is not found, then “481 Dialog does not exist” response is returned using the transport layer for non-ACK requests. For an ACK request if a matching dialog is not found, then we try to locate the original INVITE transaction. If a transaction is found, then the new request is delivered to that original transaction, otherwise we ignore the ACK request. No server transaction is created for an ACK request. TODO: check if this is the right processing in this case. If a matching dialog is found, then the new server transaction is created using that dialog object.
If the request is not CANCEL and there is no tag parameter in the To header, which means this is an out-of-dialog request, then the processing is as follows. The stack invokes the application’s callback to create a new UAS, i.e., UserAgent object in server mode. If the application accepts the request and creates a UserAgent object then the new server transaction is created out of this UserAgent object, otherwise a “405 Method not allowed” response is returned for non-ACK requests via the transport layer.
The Stack should respond to an out-of-dialog OPTIONS request event if the application doesn’t want to create UAS. We do this by creating a “200 OK” response with the Allow header containing list of supported methods, but no message body – since the stack doesn’t know about the session description.
If the incoming message is parsed into a response, the processing is as follows. If the Via header is missing, it generates an error. Otherwise it extracts the branch parameter from the top-most Via header, and the method attribute from the CSeq header. These properties are used to create a transaction identifier to match against all existing transactions.
If a matching transaction is found for the response, then the response is handed over to the transaction object for further processing.
If no matching transaction is found for the response, then the processing depends on the response and original request type. If the response is a 2xx-class response of an INVITE request, then we try to find a matching dialog for the response. If a matching dialog is found, the response is handed over to the Dialog object for further processing. If no matching dialog is found, it generates an error. Similarly, for all other responses it generates an error if no matching transaction is found.
As mentioned before, the Stack object maintains a table of active Dialog and Transaction objects. The findDialog method locates an existing dialog either by a dialog identifier string or using a Message object. The Dialog.extractId method is used to extract a dialog identifier string from a Message object. As we will see later, a dialog identifier consists of the Call-Id, local-tag and remote-tag properties. The method returns None if a dialog is not found.
The findTransaction method can be used to locate an existing Transaction object given the transaction identifier string. The method returns None, if a transaction is not found.
The findOtherTransaction method returns another transaction other than the specified original transaction orig that matches the given request r of type Message. Although the implementation described below iterates through all transactions to find a match, a more efficient hash table can be built for such operation. The Transaction.equals method is invoked to compare the request against a transaction such that it is different from the original transaction. The method returns None if no other transaction is found. The method is useful in request merging and loop-detection logic in the UAS implementation.
To finish up the implementation of the Stack class, we define a bunch of wrapper methods to shield the callback invocation between the rest of the SIP implementation and the application. The rest of the SIP layers invoke these wrapper methods on the Stack, which in turn invokes the application callback. These wrapper methods are typically invoked by UAS/UAC or dialog layer. This allows us to be consistent across these callbacks by supplying the Stack reference as the last parameter in all the application callbacks, as we had discussed earlier.
The Stack layer we described controls the main core logic of the SIP implementation as well as the interface between the SIP implementation and the application. The other layers such as transaction, UAC/UAS and dialog are very precisely described in RFC3261, hence should be easier to implement compared to the Stack class. We describe the implementation of the other layers next.
From RFC3261 p.34 – A user agent represents an end system. It contains a user agent client (UAC), which generates requests, and a user agent server (UAS), which responds to them. A UAC is capable of generating a request based on some external stimulus (the user clicking a button, or a signal on a PSTN line) and processing a response. A UAS is capable of receiving a request and generating a response based on user input, external stimulus, the result of a program execution, or some other mechanism.
We implement UAC and UAS using a single class UserAgent. The object behaves differently depending on whether it is a client (UAC) or a server (UAS). The property server identifies whether it is a server (True) or a client (False). A UAC or a UAS can create a dialog on certain conditions, such as when a 2xx-class response to an INVITE UAC is received or when a 2xx class response to an INVITE UAS is sent.
UAC and UAS procedures depend strongly on two factors. First, based on whether the request or response is inside or outside of a dialog, and second, based on the method of a request.
The procedure to send a request or response or process an incoming request or response in a user agent is slightly different than that in a dialog context. But there are a number of properties that are common between a user agent and a dialog. Instead of defining separate independent classes for implementing a user agent and a dialog, we derive the Dialog class from the UserAgent class. Most of the properties defined here are reused in the derived class Dialog.
The constructor for a UAS takes the original incoming request Message whereas for a UAC it should not. The first argument is a reference to the associated Stack object so that various functions on the stack can be performed. The second argument is the optional original request needed for constructing a UAS. The last argument in the constructor defines whether this is a UAS or UAC.
Each user agent stores a reference to the last Transaction associated with this user agent. It also stores a reference to the cancel request Message is it was sent or needs to be sent. We will describe these properties later when they are used.
The callId property refers to the unique Call-ID for this user agent or dialog. It is either extracted from the existing request Message for a UAS, or created randomly on the stack context for a UAC.
Each user agent or dialog has localParty and remoteParty properties that refer to the SIP Address of the local entity or the remote entity. These addresses are put in the From and To headers of the generated request or extracted from the To and From headers of the received request, respectively.
The To and From headers also need a tag parameter. The user agent and dialog objects store the unique localTag and remoteTag properties. The local tag is derived from the unique tag associated with the stack context, but with additional randomness to it. This allows the tag parameter to uniquely identify the stack but also be different for different dialogs or user agents.
The subject property is used for the Subject header in the SIP request within this user agent or dialog.
From RFC3261 – A dialog contains certain pieces of state needed for further message transmissions within the dialog. This state consists of the dialog ID, a local sequence number (used to order requests from the UA to its peer), a remote sequence number (used to order requests from its peer to the UA), a local URI, a remote URI, remote target, a boolean flag called "secure", and a route set, which is an ordered list of URIs. The route set is the list of servers that need to be traversed to send a request to the peer.
The secure property indicates whether this user agent or dialog is operating on a secure connection using the “sips:” URI scheme or not.
The outgoing SIP requests should have a Max-Forwards header to limit the number of SIP hops to traverse among intermediate proxies. Similarly, the request can have a Route header to pre-determine the SIP hops to traverse for a request. These headers are generated using the maxForwards and routeSet properties. The default value of Max-Forwards header is 70. The routeSet is either derived using the Record-Route header as described later or pre-set by the application, e.g., for setting the outbound proxy.
The exact host-port to be used for sending a request or response is derived from the DNS lookup for outgoing requests, and for response from various header fields of incoming requests. The remoteCandidates property stores the list of potential DNS entries to try for sending an initial request. The localTarget and the remoteTarget properties store the local and remote addresses to which a request or response will be sent in a user agent or dialog.
The local sequence number is incremented for subsequent requests in a dialog. The remote sequence number is used to detect whether an incoming request in a dialog is obsolete and should be ignored or not. These pieces of state are stored in the localSeq and remoteSeq properties, respectively.
Certain outgoing requests or responses need to have a Contact header that represents the local user’s contact, so that incoming request that be sent on that contact for subsequent requests in this dialog. For example, the UAS can return a Contact header in the 2xx-class response to an incoming INVITE request. The remote party should then use the address specified in the Contact header to send future requests such as BYE within this dialog, provided the constraints of route set allows it. We define the contact property to store this local SIP address, such that the user part of the address is derived from the localParty address whereas the host and port parts are derived from the local listening point in the stack context.
For example, if the local party is “sip:firstname.lastname@example.org” and the stack’s listening address is “sip:126.96.36.199:5080” then the contact property represents the address “sip:email@example.com:5080”.
Besides the above properties we also need two additional properties specific for our implementation. In particular the autoack property indicates whether the implementation should automatically send an ACK to the 2xx-class response to an incoming INVITE request, or whether the implementation should let the application send the ACK explicitly. If the implementation sends the ACK automatically, then it performs some functions of the application as per RFC 3261, because the specification defines that ACK for 2xx-class response to INVITE should be sent end-to-end by the application. However, in practice the application may not want to deal with the specifics of SIP implementation. By default we let the SIP implementation automatically send the ACK, but if the application does want to be in control of sending the ACK, e.g., to change the message body in the ACK, then it can set the autoack property to False.
Finally, the auth property stores the various authentication contexts such as user credentials and other properties. The authentication context is used for authenticating an outgoing request that has been challenged by the remote party.
A string representation of the object just displays the Call-ID property of the object and the name of the class, whether it is a Dialog or a UserAgent.
The application can create a new out-of-dialog request using the createRequest method on a newly created UserAgent object. The method sets the UserAgent object as a UAC.
From RFC3261 p.35 – Examples of requests sent outside of a dialog include an INVITE to establish a session and an OPTIONS to query for capabilities.
The method first checks whether the needed properties such as remoteParty and localParty are set correctly or not. It is an error if the remote party address is unknown when creating a request. If the local party address is unknown, the implementation uses the anonymous address.
The From header field allows for a display name. A UAC SHOULD use the display name "Anonymous", along with a syntactically correct, but otherwise meaningless URI (like sip:firstname.lastname@example.org), if the identity of the client is to remain hidden.
(UAC) The initial Request-URI of the message SHOULD be set to the value of the URI in the To field.
One notable exception is the REGISTER method; behavior for setting the Request-URI of REGISTER is given in Section 10. (In Section 10) The "userinfo" and "@" components of the SIP URI MUST NOT be present.
The To header field first and foremost specifies the desired "logical" recipient of the request, or the address-of-record of the user or resource that is the target of this request. This may or may not be the ultimate recipient of the request.
(UAC) A request outside of a dialog MUST NOT contain a To tag; the tag in the To field of a request identifies the peer of the dialog. Since no dialog is established, no tag is present.
(Dialog) The URI in the To field of the request MUST be set to the remote URI from the dialog state.
The From header field indicates the logical identity of the initiator of the request, possibly the user's address-of-record. Like the To header field, it contains a URI and optionally a display name. It is used by SIP elements to determine which processing rules to apply to a request (for example, automatic call rejection). As such, it is very important that the From URI not contain IP addresses or the FQDN of the host on which the UA is running, since these are not logical names.
(UAC) The From field MUST contain a new "tag" parameter, chosen by the UAC.
(Dialog) The From URI of the request MUST be set to the local URI from the dialog state. The tag in the From header field of the request MUST be set to the local tag of the dialog ID. If the value of the remote or local tags is null, the tag parameter MUST be omitted from the To or From header fields, respectively.
The CSeq header field serves as a way to identify and order transactions. It consists of a sequence number and a method. The method MUST match that of the request. For non-REGISTER requests outside of a dialog, the sequence number value is arbitrary. The sequence number value MUST be expressible as a 32-bit unsigned integer and MUST be less than 2**31. As long as it follows the above guidelines, a client may use any mechanism it would like to select CSeq header field values.
The Call-ID header field acts as a unique identifier to group together a series of messages. It MUST be the same for all requests and responses sent by either UA in a dialog.
(Dialog) The Call-ID of the request MUST be set to the Call-ID of the dialog.
The Max-Forwards header field serves to limit the number of hops a request can transit on the way to its destination. It consists of an integer that is decremented by one at each hop. If the Max-Forwards value reaches 0 before the request reaches its destination, it will be rejected with a 483(Too Many Hops) error response.
A UAC MUST insert a Max-Forwards header field into each request it originates with a value that SHOULD be 70. This number was chosen to be sufficiently large to guarantee that a request would not be dropped in any SIP network when there were no loops, but not so large as to consume proxy resources when a loop does occur. Lower values should be used with caution and only in networks where topologies are known by the UA.
When the UAC creates a request, it MUST insert a Via into that request. The protocol name and protocol version in the header field MUST be SIP and 2.0, respectively. The Via header field value MUST contain a branch parameter. This parameter is used to identify the transaction created by that request. This parameter is used by both the client and the server.
(UAC) The Contact header field provides a SIP or SIPS URI that can be used to contact that specific instance of the UA for subsequent requests. The scope of the Contact is global. That is, the Contact header field value contains the URI at which the UA would like to receive requests, and this URI MUST be valid even if used in subsequent requests outside of any dialogs. If the Request-URI or top Route header field value contains a SIPS URI, the Contact header field MUST contain a SIPS URI as well.
(Dialog) A UAC SHOULD include a Contact header field in any target refresh requests within a dialog, and unless there is a need to change it, the URI SHOULD be the same as used in previous requests within the dialog. If the "secure" flag is true, that URI MUST be a SIPS URI.
A valid SIP request formulated by a UAC MUST, at a minimum, contain the following header fields: To, From, CSeq, Call-ID, Max-Forwards, and Via; all of these header fields are mandatory in all SIP requests.
In some special circumstances, the presence of a pre-existing route set can affect the Request-URI of the message. A pre-existing route set is an ordered set of URIs that identify a chain of servers, to which a UAC will send outgoing requests that are outside of a dialog. Commonly, they are configured on the UA by a user or service provider manually, or through some other non-SIP mechanism. When a provider wishes to configure a UA with an outbound proxy, it is RECOMMENDED that this be done by providing it with a pre-existing route set with a single URI, that of the outbound proxy.
When a pre-existing route set is present, the procedures for populating the Request-URI and Route header field detailed in Section 188.8.131.52 MUST be followed (even though there is no dialog), using the desired Request-URI as the remote target URI.
If the UAC supports extensions to SIP that can be applied by the server to the response, the UAC SHOULD include a Supported header field in the request listing the option tags (Section 19.2) for those extensions.
If the UAC wishes to insist that a UAS understand an extension that the UAC will apply to the request in order to process the request, it MUST insert a Require header field into the request listing the option tag for that extension. If the UAC wishes to apply an extension to the request and insist that any proxies that are traversed understand that extension, it MUST insert a Proxy-Require header field into the request listing the option tag for that extension.
SIP requests MAY contain a MIME-encoded message-body. Regardless of the type of body that a request contains, certain header fields must be formulated to characterize the contents of the body.
To: The To header field contains the address of record whose registration is to be created, queried, or modified. The To header field and the Request-URI field typically differ, as the former contains a user name. This address-of-record MUST be a SIP URI or SIPS URI.
From: The From header field contains the address-of-record of the person responsible for the registration. The value is the same as the To header field unless the request is a third-party registration.
Except as noted, the construction of the REGISTER request and the behavior of clients sending a REGISTER request is identical to the general UAC behavior
From RFC3261 p.41 – The destination for the request is then computed. Unless there is local policy specifying otherwise, the destination MUST be determined by applying the DNS procedures described in  as follows. If the first element in the route set indicated a strict router (resulting in forming the request as described in Section 184.108.40.206), the procedures MUST be applied to the Request-URI of the request. Otherwise, the procedures are applied to the first Route header field value in the request (if one exists), or to the request's Request-URI if there is no Route header field present. These procedures yield an ordered set of address, port, and transports to attempt. Independent of which URI is used as input to the procedures of , if the Request-URI specifies a SIPS resource, the UAC MUST follow the procedures of  as if the input URI were a SIPS URI.
Local policy MAY specify an alternate set of destinations to attempt. If the Request-URI contains a SIPS URI, any alternate destinations MUST be contacted with TLS. Beyond that, there are no restrictions on the alternate destinations if the request contains no Route header field. This provides a simple alternative to a pre-existing route set as a way to specify an outbound proxy. However, that approach for configuring an outbound proxy is NOT RECOMMENDED; a pre-existing route set with a single URI SHOULD be used instead. If the request contains a Route header field, the request SHOULD be sent to the locations derived from its topmost value, but MAY be sent to any server that the UA is certain will honor the Route and Request-URI policies specified in this document (as opposed to those in RFC 2543). In particular, a UAC configured with an outbound proxy SHOULD attempt to send the request to the location indicated in the first Route header field value instead of adopting the policy of sending all messages to the outbound proxy.
The UAC SHOULD follow the procedures defined in  for stateful elements, trying each address until a server is contacted. Each try constitutes a new transaction, and therefore each carries a different topmost Via header field value with a new branch parameter. Furthermore, the transport value in the Via header field is set to whatever transport was determined for the target server.
From RFC3261 p.42 – In some cases, the response returned by the transaction layer will not be a SIP message, but rather a transaction layer error. When a timeout error is received from the transaction layer, it MUST be treated as if a 408 (Request Timeout) status code has been received. If a fatal transport error is reported by the transport layer (generally, due to fatal ICMP errors in UDP or connection failures in TCP), the condition MUST be treated as a 503 (Service Unavailable) status code.
When the transaction times out we try the next candidate address.
The following method is invoked to try the next candidate address from the DNS result for the destination.
A transport error in sending a request is treated as “503 Service unavailable”.
From RFC3261 p.42 – Responses are first processed by the transport layer and then passed up to the transaction layer. The transaction layer performs its processing and then passes the response up to the TU. The majority of response processing in the TU is method specific. However, there are some general behaviors independent of the method.
If more than one Via header field value is present in a response, the UAC SHOULD discard the message.
If a 401 (Unauthorized) or 407 (Proxy Authentication Required) response is received, the UAC SHOULD follow the authorization procedures of Section 22.2 and Section 22.3 to retry the request with credentials.
The global method canCreateDialog determines whether a dialog can be created out of the given response for the given original request. The current implementation creates a dialog for a 2xx-class response for the original INVITE or SUBSCRIBE requests.
Dialogs are created through the generation of non-failure responses to requests with specific methods. Within this specification, only 2xx and 101-199 responses with a To tag, where the request was INVITE, will establish a dialog. A dialog established by a non-final response to a request is in the "early" state and it is called an early dialog.
In our implementation we do not support early dialogs.
From RFC3261 p46 – When a request outside of a dialog is processed by a UAS, there is a set of processing rules that are followed, independent of the method.
Note that request processing is atomic. If a request is accepted, all state changes associated with it MUST be performed. If it is rejected, all state changes MUST NOT be performed.
UASs SHOULD process the requests in the order of the steps that follow in this section (that is, starting with authentication, then inspecting the method, the header fields, and so on throughout the remainder of this section).
Once a request is authenticated (or authentication is skipped), the UAS MUST inspect the method of the request. If the UAS recognizes but does not support the method of a request, it MUST generate a 405 (Method Not Allowed) response. Procedures for generating responses are described in Section 8.2.6. The UAS MUST also add an Allow header field to the 405 (Method Not Allowed) response. The Allow header field MUST list the set of methods supported by the UAS generating the message. The Allow header field is presented in Section 20.5.
If the method is one supported by the server, processing continues.
However, the Request-URI identifies the UAS that is to process the request. If the Request-URI uses a scheme not supported by the UAS, it SHOULD reject the request with a 416 (Unsupported URI Scheme) response.
If the request has no tag in the To header field, the UAS core MUST check the request against ongoing transactions. If the From tag, Call-ID, and CSeq exactly match those associated with an ongoing transaction, but the request does not match that transaction, the UAS core SHOULD generate a 482 (Loop Detected) response and pass it to the server transaction.
Assuming the UAS decides that it is the proper element to process the request, it examines the Require header field, if present.
The Require header field is used by a UAC to tell a UAS about SIP extensions that the UAC expects the UAS to support in order to process the request properly. If a UAS does not understand an option-tag listed in a Require header field, it MUST respond by generating a response with status code 420 (Bad Extension). The UAS MUST add an Unsupported header field, and list in it those options it does not understand amongst those in the Require header field of the request. Note that Require and Proxy-Require MUST NOT be used in a SIP CANCEL request, or in an ACK request sent for a non-2xx response. These header fields MUST be ignored if they are present in these requests.
An ACK request for a 2xx response MUST contain only those Require and Proxy-Require values that were present in the initial request.
The CANCEL method requests that the TU at the server side cancel a pending transaction. The TU determines the transaction to be cancelled by taking the CANCEL request, and then assuming that the request method is anything but CANCEL or ACK and applying the transaction matching procedures of Section 17.2.3. The matching transaction is the one to be cancelled.
The processing of a CANCEL request at a server depends on the type of server. A stateless proxy will forward it, a stateful proxy might respond to it and generate some CANCEL requests of its own, and a UAS will respond to it. See Section 16.10 for proxy treatment of CANCEL.
A UAS first processes the CANCEL request according to the general UAS processing described in Section 8.2. However, since CANCEL requests are hop-by-hop and cannot be resubmitted, they cannot be challenged by the server in order to get proper credentials in an Authorization header field. Note also that CANCEL requests do not contain a Require header field.
If the UAS did not find a matching transaction for the CANCEL according to the procedure above, it SHOULD respond to the CANCEL with a 481 (Call Leg/Transaction Does Not Exist). If the transaction for the original request still exists, the behavior of the UAS on receiving a CANCEL request depends on whether it has already sent a final response for the original request. If it has, the CANCEL request has no effect on the processing of the original request, no effect on any session state, and no effect on the responses generated for the original request. If the UAS has not issued a final response for the original request, its behavior depends on the method of the original request. If the original request was an INVITE, the UAS SHOULD immediately respond to the INVITE with a 487 (Request Terminated). A CANCEL request has no impact on the processing of transactions with any other method defined in this specification.
Regardless of the method of the original request, as long as the CANCEL matched an existing transaction, the UAS answers the CANCEL request itself with a 200 (OK) response. This response is constructed following the procedures described in Section 8.2.6 noting that the To tag of the response to the CANCEL and the To tag in the response to the original request SHOULD be the same. The response to CANCEL is passed to the server transaction for transmission.
Finally, the request is delivered to the application for further processing after the UAS procedures are applied.
From RFC3261 p.49 – When a UAS wishes to construct a response to a request, it follows the general procedures detailed in the following subsections. Additional behaviors specific to the response code in question, which are not detailed in this section, may also be required.
Once all procedures associated with the creation of a response have been completed, the UAS hands the response back to the server transaction from which it received the request.
When a UAS responds to a request with a response that establishes a dialog (such as a 2xx to INVITE), the UAS MUST copy all Record-Route header field values from the request into the response (including the URIs, URI parameters, and any Record-Route header field parameters, whether they are known or unknown to the UAS) and MUST maintain the order of those values.
The UAS MUST add a Contact header field to the response. The Contact header field contains an address where the UAS would like to be contacted for subsequent requests in the dialog (which includes the ACK for a 2xx response in the case of an INVITE). Generally, the host portion of this URI is the IP address or FQDN of the host. The URI provided in the Contact header field MUST be a SIP or SIPS URI. If the request that initiated the dialog contained a SIPS URI in the Request-URI or in the top Record-Route header field value, if there was any, or the Contact header field if there was no Record-Route header field, the Contact header field in the response MUST be a SIPS URI. The URI SHOULD have global scope (that is, the same URI can be used in messages outside this dialog). The same way, the scope of the URI in the Contact header field of the INVITE is not limited to this dialog either. It can therefore be used in messages to the UAC even outside this dialog.
The UAS then constructs the state of the dialog. This state MUST be maintained for the duration of the dialog.
Additionally, the UAS MUST add a tag to the To header field in the response (with the exception of the 100 (Trying) response, in which a tag MAY be present). This serves to identify the UAS that is responding, possibly resulting in a component of a dialog ID. The same tag MUST be used for all responses to that request, both final and provisional (again excepting the 100 (Trying)).
From RFC3261 p.53 – The CANCEL request, as the name implies, is used to cancel a previous request sent by a client. Specifically, it asks the UAS to cease processing the request and to generate an error response to that request. CANCEL has no effect on a request to which a UAS has already given a final response. Because of this, it is most useful to CANCEL requests to which it can take a server long time to respond. For this reason, CANCEL is best for INVITE requests, which can take a long time to generate a response. In that usage, a UAS that receives a CANCEL request for an INVITE, but has not yet sent a final response, would "stop ringing", and then respond to the INVITE with a specific error response (a 487).
The following procedures are used to construct a CANCEL request. The Request-URI, Call-ID, To, the numeric part of CSeq, and From header fields in the CANCEL request MUST be identical to those in the request being cancelled, including tags. A CANCEL constructed by a client MUST have only a single Via header field value matching the top Via value in the request being cancelled. Using the same values for these header fields allows the CANCEL to be matched with the request it cancels (Section 9.2 indicates how such matching occurs). However, the method part of the CSeq header field MUST have a value of CANCEL. This allows it to be identified and processed as a transaction in its own right (See Section 17).
If the request being cancelled contains a Route header field, the CANCEL request MUST include that Route header field's values. The CANCEL request MUST NOT contain any Require or Proxy-Require header fields.
Once the CANCEL is constructed, the client SHOULD check whether it has received any response (provisional or final) for the request being cancelled (herein referred to as the "original request").
If no provisional response has been received, the CANCEL request MUST NOT be sent; rather, the client MUST wait for the arrival of a provisional response before sending the request. If the original request has generated a final response, the CANCEL SHOULD NOT be sent, as it is an effective no-op, since CANCEL has no effect on requests that have already generated a final response. When the client decides to send the CANCEL, it creates a client transaction for the CANCEL and passes it the CANCEL request along with the destination address, port, and transport. The destination address, port, and transport for the CANCEL MUST be identical to those used to send the original request.
From RFC3261 p.196 – When the originating UAC receives the 401 (Unauthorized), it SHOULD, if it is able, re-originate the request with the proper credentials. The UAC may require input from the originating user before proceeding. Once authentication credentials have been supplied (either directly by the user, or discovered in an internal keyring), UAs SHOULD cache the credentials for a given value of the To header field and "realm" and attempt to re-use these values on the next request for that destination. UAs MAY cache credentials in any way they would like.
When a request receives a 401 or 407 response in a UAC, we invoke the authenticate method. If the application has supplied the local user’s credentials, then we use that to resend the request in a new transaction, in the same UAC. If the request was resent, then it returns True, otherwise it returns False.
Once credentials have been located, any UA that wishes to authenticate itself with a UAS or registrar -- usually, but not necessarily, after receiving a 401 (Unauthorized) response -- MAY do so by including an Authorization header field with the request. The Authorization field value consists of credentials containing the authentication information of the UA for the realm of the resource being requested as well as parameters required in support of authentication and replay protection.
When a UAC resubmits a request with its credentials after receiving a 401 (Unauthorized) or 407 (Proxy Authentication Required) response, it MUST increment the CSeq header field value as it would normally when sending an updated request.
From RFC3261 p.69 – A key concept for a user agent is that of a dialog. A dialog represents a peer-to-peer SIP relationship between two user agents that persists for some time. The dialog facilitates sequencing of messages between the user agents and proper routing of requests between both of them. The dialog represents a context in which to interpret SIP messages.
Since a number of properties are shared between the UAC/UAS and the dialog context, we derive the Dialog class from the UserAgent class.
The route set MUST be set to the list of URIs in the Record-Route header field from the request, taken in order and preserving all URI parameters. If no Record-Route header field is present in the request, the route set MUST be set to the empty set. This route set, even if empty, overrides any pre-existing route set for future requests in this dialog. The remote target MUST be set to the URI from the Contact header field of the request.
If the request arrived over TLS, and the Request-URI contained a SIPS URI, the "secure" flag is set to TRUE.
The remote sequence number MUST be set to the value of the sequence number in the CSeq header field of the request. The local sequence number MUST be empty. The call identifier component of the dialog ID MUST be set to the value of the Call-ID in the request. The local tag component of the dialog ID MUST be set to the tag in the To field in the response to the request (which always includes a tag), and the remote tag component of the dialog ID MUST be set to the tag from the From field in the request. A UAS MUST be prepared to receive a request without a tag in the From field, in which case the tag is considered to have a value of null.
The remote URI MUST be set to the URI in the From field, and the local URI MUST be set to the URI in the To field.
When a UAC sends a request that can establish a dialog (such as an INVITE) it MUST provide a SIP or SIPS URI with global scope (i.e., the same SIP URI can be used in messages outside this dialog) in the Contact header field of the request. If the request has a Request-URI or a topmost Route header field value with a SIPS URI, the Contact header field MUST contain a SIPS URI. When a UAC receives a response that establishes a dialog, it constructs the state of the dialog. This state MUST be maintained for the duration of the dialog.
If the request was sent over TLS, and the Request-URI contained a SIPS URI, the "secure" flag is set to TRUE.
The route set MUST be set to the list of URIs in the Record-Route header field from the response, taken in reverse order and preserving all URI parameters. If no Record-Route header field is present in the response, the route set MUST be set to the empty set. This route set, even if empty, overrides any pre-existing route set for future requests in this dialog. The remote target MUST be set to the URI from the Contact header field of the response.
The local sequence number MUST be set to the value of the sequence number in the CSeq header field of the request. The remote sequence number MUST be empty (it is established when the remote UA sends a request within the dialog). The call identifier component of the dialog ID MUST be set to the value of the Call-ID in the request. The local tag component of the dialog ID MUST be set to the tag in the From field in the request, and the remote tag component of the dialog ID MUST be set to the tag in the To field of the response. A UAC MUST be prepared to receive a response without a tag in the To field, in which case the tag is considered to have a value of null.
The remote URI MUST be set to the URI in the To field, and the local URI MUST be set to the URI in the From field.
A dialog ID is also associated with all responses and with any request that contains a tag in the To field. The rules for computing the dialog ID of a message depend on whether the SIP element is a UAC or UAS. For a UAC, the Call-ID value of the dialog ID is set to the Call-ID of the message, the remote tag is set to the tag in the To field of the message, and the local tag is set to the tag in the From field of the message (these rules apply to both requests and responses). As one would expect for a UAS, the Call-ID value of the dialog ID is set to the Call-ID of the message, the remote tag is set to the tag in the From field of the message, and the local tag is set to the tag in the To field of the message.
The method extractId extracts the dialog identifier string from a given incoming request or response Message.
The constructor takes the original request Message, the server flag indicating whether this is a UAS or UAC, and the original transaction reference to create the Dialog object out of an existing UAS or UAC.
Destroying an existing dialog is done by invoking the close method. It removes the dialog object from the table of dialogs maintained in the stack context. TODO: we should set the stack property to None, but it causes problem in receivedResponse method if the stack is None.
A dialog is identified at each UA with a dialog ID, which consists of a Call-ID value, a local tag and a remote tag. The dialog ID at each UA involved in the dialog is not the same. Specifically, the local tag at one UA is identical to the remote tag at the peer UA. The tags are opaque tokens that facilitate the generation of unique dialog IDs.
The id property refers to the dialog identifier string, which is constructed from the Call-ID, the local tag and the remote tag parameters.
From RFC3261 p.73 – A request within a dialog is constructed by using many of the components of the state stored as part of the dialog. The tag in the To header field of the request MUST be set to the remote tag of the dialog ID.
If the route set is empty, the UAC MUST place the remote target URI into the Request-URI. The UAC MUST NOT add a Route header field to the request.
If the route set is not empty, and its first URI does not contain the lr parameter, the UAC MUST place the first URI from the route set into the Request-URI, stripping any parameters that are not allowed in a Request-URI. The UAC MUST add a Route header field containing the remainder of the route set values in order, including all parameters. The UAC MUST then place the remote target URI into the Route header field as the last value.
If the route set is not empty, and the first URI in the route set contains the lr parameter (see Section 19.1.1), the UAC MUST place the remote target URI into the Request-URI and MUST include a Route header field containing the route set values in order, including all parameters.
Once the request has been constructed, the address of the server is computed and the request is sent, using the same procedures for requests outside of a dialog.
To send a new response in this dialog for the first pending server transaction the application invokes the sendResponse method. The first argument can be either the response status code or a well formatted response Message.
From RFC3261 p.122 -- SIP is a transactional protocol: interactions between components take place in a series of independent message exchanges. Specifically, a SIP transaction consists of a single request and any responses to that request, which include zero or more provisional responses and one or more final responses.
Transactions have a client side and a server side. The client side is known as a client transaction and the server side as a server transaction. The client transaction sends the request, and the server transaction sends the response. The client and server transactions are logical functions that are embedded in any number of elements. Specifically, they exist within user agents and stateful proxy servers.
The purpose of the client transaction is to receive a request from the element in which the client is embedded (call this element the "Transaction User" or TU; it can be a UA or a stateful proxy), and reliably deliver the request to a server transaction. The client transaction is also responsible for receiving responses and delivering them to the TU, filtering out any response retransmissions or disallowed responses (such as a response to ACK).
Similarly, the purpose of the server transaction is to receive requests from the transport layer and deliver them to the TU. The server transaction filters any request retransmissions from the network. The server transaction accepts responses from the TU and delivers them to the transport layer for transmission over the network.
We define a class Transaction to represent a SIP transaction. This is an abstract class. The actual implementations of client and server transactions are done in ClientTransaction and ServerTransaction classes, respectively. These classes are used for non-INVITE transaction. SIP defines different processing for INVITE and non-INVITE transactions. The INVITE transactions are implemented using InviteClientTransaction and InviteServerTransaction classes. The transaction user (or TU) in our implementation is UserAgent object (or the derived Dialog object).
Let’s start by defining the transaction object properties. A transaction is identified by an identifier, id. The transaction identifier is usually derived from the branch parameter of the top-most Via header. Thus, we store the branch property as well. Each transaction has an original SIP request from which the transaction was created. The associated transport information and the remote host-port tuple give information about where to send a request or response in a transaction. We store the tag supplied by the TU for a server transaction. The server Boolean flag indicates whether this is a client (False) or server (True) transaction. The transaction module may need to use the functions from the Stack object referred by the stack property. A reference to the transaction user (TU) is stored in the app property. Finally, the transaction has collection of active timers as well as timer duration values for different type of timers as defined in the specification.
When a transaction is closed, we stop all the timers and remove this transaction instance from the collection of transactions maintained by the Stack. As described earlier, the stack object maintains a table of all the transactions indexed by the transaction identifier string.
Note that we couldn’t use the destructor method, because as long as a reference to this transaction is stored in the transactions table, the destructor will not get invoked. Hence we need an explicit close method to destroy the transaction. The close method gets invoked whenever the transaction state is changed to “terminating”. Thus we define another property, state, to maintain the transaction state and explicitly invoke close when the state changes to “terminating”.
From RFC3261 – The client transaction MUST be destroyed the instant it enters the "Terminated" state. This is actually necessary to guarantee correct operation.
Once the transaction is in the "Terminated" state, it MUST be destroyed immediately. As with client transactions, this is needed to ensure reliability of the 2xx responses to INVITE.
SIP defines four headers – To, From, CSeq, Call-ID – as transaction identifying headers. The values of these header fields remains the same within a transaction, although the header parameters may change – e.g., the tag parameter gets added to the To header. We define a read-only property, headers, which gives a list of these four headers.
SIP imposes certain restrictions on creation of branch parameter. In particular, the RFC3261 compliant implementation must start the branch parameter with “z9hG4bK” to distinguish against previous RFC2543 implementations. In practice, an implementation must choose the branch parameter carefully, so that it can be used to match a transaction, i.e., act as a transaction identifier. Most of the implementations that I have seen use some combination of transaction headers to create the branch parameter.
From RFC3261 p.29 – The Via header field value MUST contain a branch parameter. This parameter is used to identify the transaction created by that request. This parameter is used by both the client and the server.
The branch ID inserted by an element compliant with this specification MUST always begin with the characters "z9hG4bK". These 7 characters are used as a magic cookie (7 is deemed sufficient to ensure that an older RFC 2543 implementation would not pick such a value), so that servers receiving the request can determine that the branch ID was constructed in the fashion described by this specification (that is, globally unique). Beyond this requirement, the precise format of the branch token is implementation-defined.
The function createBranch defined below uses the information from the transaction identifying headers, along with the server flag. The server flag is needed so that a client transaction doesn’t interfere with a server transaction while searching for a transaction in the transactions table. Note that we only use the header value of To and From without the parameters, and the number field from CSeq header without the method name. This is important so that a response with tag parameter in the To header gets matched with the original transaction that didn’t have the tag parameter in the To header. Secondly, the branch parameter for the CANCEL and ACK request remains the same as that of the original INVITE if we don’t include the method of CSeq header in computing the branch. Finally we use a one-way hash such as MD5 and modified Base64 encoding to construct a random-looking branch parameter from the assembled data. The modified Base64 encoding is needed because certain characters in the original Base64 grammar are not allowed in the branch grammar by the specification.
For added flexibility, we allow overloaded method invocation where the first argument can be either a Message object or a list of the individual fields needed for computing the branch.
The branch parameter value MUST be unique across space and time for all requests sent by the UA. The exceptions to this rule are CANCEL and ACK for non-2xx responses. A CANCEL request will have the same value of the branch parameter as the request it cancels. An ACK for a non-2xx response will also have the same branch ID as the INVITE whose response it acknowledges.
The uniqueness property of the branch parameter allows us to use it as the transaction identifier, with a couple of exception – if the method is either CANCEL or ACK, then even though the branch parameter is same as the original INVITE request, the transaction identifier should be different. We define the createId method to construct such a transaction identifier, which appends the method name if the method is ACK or CANCEL. The transaction identifier is used as a key in our lookup table of transactions.
The TU wishes to create a new server transaction it invokes the createServer factory method by supplying the incoming request Message, the associated transport information on which the request was received and the application tag parameter to use in the transaction response. The method hides the implementation details of what type of object is used for the particular transaction, e.g., whether INVITE or non-INVITE transactions use separate implementations.
For a server transaction, certain transaction properties are derived from the incoming message, e.g., the branch and remote address property are extracted from the top-most Via header. If a branch parameter is missing in the request, probably due to old implementation of the specification, then the branch parameter is constructed using the request message as described earlier.
Finally, the transaction identifier is created, the transaction is stored in the transactions table, the transaction state machine is started, and the transaction object is returned as the newly created server transaction.
From RFC3261 – The TU communicates with the client transaction through a simple interface. When the TU wishes to initiate a new transaction, it creates a client transaction and passes it the SIP request to send and an IP address, port, and transport to which to send it. The client transaction begins execution of its state machine. Valid responses are passed up to the TU from the client transaction.
There are two types of client transaction state machines, depending on the method of the request passed by the TU. One handles client transactions for INVITE requests. This type of machine is referred to as an INVITE client transaction. Another type handles client transactions for all requests except INVITE and ACK. This is referred to as a non-INVITE client transaction.
When the TU wishes to create a client transaction for sending out a new request, it uses the createClient method and supplies the request Message, the associated transport which will be used to send the request and the remote host-port tuple to which we want to send the request to. Similar to the creation of server transaction, this method also hides the implementation details of the type of transaction object created. The rest of the processing is very similar to the previous method.
The equals method on the transaction is used by the Stack.findOtherTransaction method to check whether a request r matches an existing transaction t1, such that transaction t1 is different from original transaction t2, but has the same direction (client or server) as the original transaction t2. When an incoming request matches another transaction (t1) even though the request is part of another original transaction (t2), we have a request merging situation, hence the request should get rejected.
To create an ACK request in a client transaction, we use the original request URI of the transaction with transaction identifying headers. The method returns None for server transaction.
To create a CANCEL request in a client transaction, we again use the original request URI of the transaction with transaction identifying headers. Additionally, for a CANCEL request the Route header is copied from the original request if needed, and only one Via header, i.e., top-most one, is kept in the request. The method returns None for server transaction.
To create a response in a server transaction, we use the original request and add the response status code (response) and reason phrase (responsetext). If the response is not “100 Trying” then we also add the tag parameter in the To header if one is missing. TODO: this should be moved to UAS? The method returns None for client transaction.
In the transaction state machine, several timers exist. The startTimer method is used to start a new named timer for the given timeout duration. To create a timer if it doesn’t already exist in this transaction, we invoke the application callback createTimer.
When the timer expires, we invoke the timeout handler method which is implemented for individual transaction state machines. The timedout method is invoked by the actual timer implementation of the application, the one that was returned in createTimer.
When cleaning up a transaction, we may need to stop all the timers associated with this transaction. The stopTimers method can be used for that purpose.
RFC3261 defines several timers which we abstract out in our implementation of the Timer class. In particular, the named timers T1, T2 and T4 configure the timeout values of all other timers, timer A to K. The default values of T1, T2 and T4 times are 500, 4000 and 5000 milliseconds. If a different default value is needed then the transaction can create the Timer object with those different values during construction.
Timer A’s value is initially same as T1, and gets updated every time the timer expires.
Timer B’s value is 64×T1.
TODO: why no timer C?
Timer D’s value is also similar to timer B, except that it caps at 32 seconds.
Timer I’s value is same as that of timer T4.
Finally we turn these derived timer values into read-only properties, such that initial values of timers A, E, G are all same, timers B, F, H, J are all same, and timers I and K are same.
From RFC3261 p.125 – The INVITE transaction consists of a three-way handshake. The client transaction sends an INVITE, the server transaction sends responses, and the client transaction sends an ACK. For unreliable transports (such as UDP), the client transaction retransmits requests at an interval that starts at T1 seconds and doubles after every retransmission. T1 is an estimate of the round-trip time (RTT), and it defaults to 500 ms. Nearly all of the transaction timers described here scale with T1, and changing T1 adjusts their values. The request is not retransmitted over reliable transports. After receiving a 1xx response, any retransmissions cease altogether, and the client waits for further responses. The server transaction can send additional 1xx responses, which are not transmitted reliably by the server transaction. Eventually, the server transaction decides to send a final response. For unreliable transports, that response is retransmitted periodically, and for reliable transports, it is sent once. For each final response that is received at the client transaction, the client transaction sends an ACK, the purpose of which is to quench retransmissions of the response.
The initial state, "calling", MUST be entered when the TU initiates a new client transaction with an INVITE request. The client transaction MUST pass the request to the transport layer for transmission (see Section 18). If an unreliable transport is being used, the client transaction MUST start timer A with a value of T1. If a reliable transport is being used, the client transaction SHOULD NOT start timer A (Timer A controls request retransmissions). For any transport, the client transaction MUST start timer B with a value of 64*T1 seconds (Timer B controls transaction timeouts).
If the client transaction receives a provisional response while in the "Calling" state, it transitions to the "Proceeding" state. In the "Proceeding" state, the client transaction SHOULD NOT retransmit the request any longer. Furthermore, the provisional response MUST be passed to the TU. Any further provisional responses MUST be passed up to the TU while in the "Proceeding" state.
When in either the "Calling" or "Proceeding" states, reception of a 2xx response MUST cause the client transaction to enter the "Terminated" state, and the response MUST be passed up to the TU. The handling of this response depends on whether the TU is a proxy core or a UAC core. A UAC core will handle generation of the ACK for this response, while a proxy core will always forward the 200 (OK) upstream. The differing treatment of 200 (OK) between proxy and UAC is the reason that handling of it does not take place in the transaction layer.
When in either the "Calling" or "Proceeding" states, reception of a response with status code from 300-699 MUST cause the client transaction to transition to "Completed". The client transaction MUST pass the received response up to the TU, and the client transaction MUST generate an ACK request, even if the transport is reliable (guidelines for constructing the ACK from the response are given in Section 220.127.116.11) and then pass the ACK to the transport layer for transmission. The ACK MUST be sent to the same address, port, and transport to which the original request was sent. The client transaction SHOULD start timer D when it enters the "Completed" state, with a value of at least 32 seconds for unreliable transports, and a value of zero seconds for reliable transports. Timer D reflects the amount of time that the server transaction can remain in the "Completed" state when unreliable transports are used. This is equal to Timer H in the INVITE server transaction, whose default is 64*T1. However, the client transaction does not know the value of T1 in use by the server transaction, so an absolute minimum of 32s is used instead of basing Timer D on T1.
Any retransmissions of the final response that are received while in the "Completed" state MUST cause the ACK to be re-passed to the transport layer for retransmission, but the newly received response MUST NOT be passed up to the TU.
When timer A fires, the client transaction MUST retransmit the request by passing it to the transport layer, and MUST reset the timer with a value of 2*T1. The formal definition of retransmit within the context of the transaction layer is to take the message previously sent to the transport layer and pass it to the transport layer once more.
When timer A fires 2*T1 seconds later, the request MUST be retransmitted again (assuming the client transaction is still in this state). This process MUST continue so that the request is retransmitted with intervals that double after each transmission. These retransmissions SHOULD only be done while the client transaction is in the "calling" state.
If the client transaction is still in the "Calling" state when timer B fires, the client transaction SHOULD inform the TU that a timeout has occurred. The client transaction MUST NOT generate an ACK. The value of 64*T1 is equal to the amount of time required to send seven requests in the case of an unreliable transport.
If timer D fires while the client transaction is in the "Completed" state, the client transaction MUST move to the terminated state.
Any transport error causes the state machine to move to the “terminated” state and updates the TU with the error message.
The ACK request constructed by the client transaction MUST contain values for the Call-ID, From, and Request-URI that are equal to the values of those header fields in the request passed to the transport by the client transaction (call this the "original request").
The To header field in the ACK MUST equal the To header field in the response being acknowledged, and therefore will usually differ from the To header field in the original request by the addition of the tag parameter.
The ACK MUST contain a single Via header field, and this MUST be equal to the top Via header field of the original request.
The CSeq header field in the ACK MUST contain the same value for the sequence number as was present in the original request, but the method parameter MUST be equal to "ACK".
If the INVITE request whose response is being acknowledged had Route header fields, those header fields MUST appear in the ACK. This is to ensure that the ACK can be routed properly through any downstream stateless proxies.
When a server transaction is constructed for a request, it enters the "Proceeding" state. The server transaction MUST generate a 100 (Trying) response unless it knows that the TU will generate a provisional or final response within 200 ms, in which case it MAY generate a 100 (Trying) response. This provisional response is needed to quench request retransmissions rapidly in order to avoid network congestion. The request MUST be passed to the TU.
Furthermore, while in the "Completed" state, if a request retransmission is received, the server SHOULD pass the response to the transport for retransmission.
If an ACK is received while the server transaction is in the "Completed" state, the server transaction MUST transition to the "Confirmed" state. As Timer G is ignored in this state, any retransmissions of the response will cease.
The purpose of the "Confirmed" state is to absorb any additional ACK messages that arrive, triggered from retransmissions of the final response. When this state is entered, timer I is set to fire in T4 seconds for unreliable transports, and zero seconds for reliable transports.
If timer G fires, the response is passed to the transport layer once more for retransmission, and timer G is set to fire in MIN(2*T1, T2) seconds. From then on, when timer G fires, the response is passed to the transport again for transmission, and timer G is reset with a value that doubles, unless that value exceeds T2, in which case it is reset with the value of T2.
If timer H fires while in the "Completed" state, it implies that the ACK was never received. In this case, the server transaction MUST transition to the "Terminated" state, and MUST indicate to the TU that a transaction failure has occurred.
Once timer I fires, the server MUST transition to the "Terminated" state.
As with the client transaction, any transport error is treated as error and propagated to the TU.
The TU passes any number of provisional responses to the server transaction. So long as the server transaction is in the "Proceeding" state, each of these MUST be passed to the transport layer for transmission. They are not sent reliably by the transaction layer (they are not retransmitted by it) and do not cause a change in the state of the server transaction. If a request retransmission is received while in the "Proceeding" state, the most recent provisional response that was received from the TU MUST be passed to the transport layer for retransmission.
If, while in the "Proceeding" state, the TU passes a 2xx response to the server transaction, the server transaction MUST pass this response to the transport layer for transmission. It is not retransmitted by the server transaction; retransmissions of 2xx responses are handled by the TU. The server transaction MUST then transition to the "Terminated" state.
While in the "Proceeding" state, if the TU passes a response with status code from 300 to 699 to the server transaction, the response MUST be passed to the transport layer for transmission, and the state machine MUST enter the "Completed" state. For unreliable transports, timer G is set to fire in T1 seconds, and is not set to fire for reliable transports.
When the "Completed" state is entered, timer H MUST be set to fire in 64*T1 seconds for all transports. Timer H determines when the server transaction abandons retransmitting the response. Its value is chosen to equal Timer B, the amount of time a client transaction will continue to retry sending a request.
From RFC3261 p.130 – Non-INVITE transactions do not make use of ACK. They are simple request-response interactions. For unreliable transports, requests are retransmitted at an interval which starts at T1 and doubles until it hits T2. If a provisional response is received, retransmissions continue for unreliable transports, but at an interval of T2. The server transaction retransmits the last response it sent, which can be a provisional or final response, only when a retransmission of the request is received. This is why request retransmissions need to continue even after a provisional response; they are to ensure reliable delivery of the final response. Unlike an INVITE transaction, a non-INVITE transaction has no special handling for the 2xx response. The result is that only a single 2xx response to a non-INVITE is ever delivered to a UAC.
The "Trying" state is entered when the TU initiates a new client transaction with a request. When entering this state, the client transaction SHOULD set timer F to fire in 64*T1 seconds. The request MUST be passed to the transport layer for transmission. If an unreliable transport is in use, the client transaction MUST set timer E to fire in T1 seconds.
If a provisional response is received while in the "Trying" state, the response MUST be passed to the TU, and then the client transaction SHOULD move to the "Proceeding" state.
If a final response (status codes 200-699) is received while in the "Trying" state, the response MUST be passed to the TU, and the client transaction MUST transition to the "Completed" state.
If a final response (status codes 200-699) is received while in the "Proceeding" state, the response MUST be passed to the TU, and the client transaction MUST transition to the "Completed" state.
Once the client transaction enters the "Completed" state, it MUST set Timer K to fire in T4 seconds for unreliable transports, and zero seconds for reliable transports.
If timer E fires while still in this state, the timer is reset, but this time with a value of MIN(2*T1, T2). When the timer fires again, it is reset to a MIN(4*T1, T2). This process continues so that retransmissions occur with an exponentially increasing interval that caps at T2. The default value of T2 is 4s, and it represents the amount of time a non-INVITE server transaction will take to respond to a request, if it does not respond immediately. For the default values of T1 and T2, this results in intervals of 500 ms, 1 s, 2 s, 4 s, 4 s, 4 s, etc.
If Timer E fires while in the "Proceeding" state, the request MUST be passed to the transport layer for retransmission, and Timer E MUST be reset with a value of T2 seconds.
If Timer F fires while the client transaction is still in the "Trying" state, the client transaction SHOULD inform the TU about the timeout, and then it SHOULD enter the "Terminated" state.
If timer F fires while in the "Proceeding" state, the TU MUST be informed of a timeout, and the client transaction MUST transition to the terminated state.
If Timer K fires while in this (“completed”) state, the client transaction MUST transition to the "Terminated" state.
The client transaction SHOULD inform the TU that a transport failure has occurred, and the client transaction SHOULD transition directly to the "Terminated" state.
From RFC3261 p.137 – The state machine is initialized in the "Trying" state and is passed a request other than INVITE or ACK when initialized. This request is passed up to the TU.
If a retransmission of the request is received while in the "Proceeding" state, the most recently sent provisional response MUST be passed to the transport layer for retransmission.
While in the "Completed" state, the server transaction MUST pass the final response to the transport layer for retransmission whenever a retransmission of the request is received.
Once in the "Trying" state, any further request retransmissions are discarded.
The server transaction remains in this state until Timer J fires, at which point it MUST transition to the "Terminated" state.
As with the client transaction, a transport error is propagated up the TU and the state transitions to “terminated”.
While in the "Trying" state, if the TU passes a provisional response to the server transaction, the server transaction MUST enter the "Proceeding" state. The response MUST be passed to the transport layer for transmission. Any further provisional responses that are received from the TU while in the "Proceeding" state MUST be passed to the transport layer for transmission.
If the TU passes a final response (status codes 200-699) to the server while in the "Proceeding" state, the transaction MUST enter the "Completed" state, and the response MUST be passed to the transport layer for transmission.
Any other final responses passed by the TU to the server transaction MUST be discarded while in the "Completed" state.
When the server transaction enters the "Completed" state, it MUST set Timer J to fire in 64*T1 seconds for unreliable transports, and zero seconds for reliable transports.
Implementing offer-answer and SDP as per RFC 3264, RFC 4566
The Session Description Protocol (SDP) is specified in RFC 4566 and defines the format for describing the session parameters in a SIP session. In particular, the SIP INVITE request and the 2xx-class response to the INVITE request can contain the message body in SDP format. The SDP data advertises the media types, list of codecs and transport addresses for the sender. Secondly, RFC 3264 defines how a SIP user agent can offer and answer the session negotiation parameters with the help of SDP. In particular, it adds additional constraints on the base SDP for usage in a SIP telephony environment.
In this chapter we implement the modules named rfc4566 and rfc3264 to implement these session description and negotiation functions for SIP telephony.
From RFC4566 p.3 – When initiating multimedia teleconferences, voice-over-IP calls, streaming video, or other sessions, there is a requirement to convey media details, transport addresses, and other session description metadata to the participants.
SDP provides a standard representation for such information, irrespective of how that information is transported. SDP is purely a format for session description -- it does not incorporate a transport protocol, and it is intended to use different transport protocols as appropriate, including the Session Announcement Protocol, Session Initiation Protocol, Real Time Streaming Protocol, electronic mail using the MIME extensions, and the Hypertext Transport Protocol.
SDP is intended to be general purpose so that it can be used in a wide range of network environments and applications. However, it is not intended to support negotiation of session content or media encodings: this is viewed as outside the scope of session description.
SDP is also used in conjunction with other protocols such as Session Announcement Protocol (SAP) and Real Time Streaming Protocol (RTSP), but those are beyond the scope of current discussion.
From RFC4566 p.7 – An SDP session description is entirely textual using the ISO 10646 character set in UTF-8 encoding. SDP field names and attribute names use only the US-ASCII subset of UTF-8, but textual fields and attribute values MAY use the full ISO 10646 character set. Field and attribute values that use the full UTF-8 character set are never directly compared, hence there is no requirement for UTF-8 normalisation. The textual form, as opposed to a binary encoding such as ASN.1 or XDR, was chosen to enhance portability, to enable a variety of transports to be used, and to allow flexible, text-based toolkits to be used to generate and process session descriptions.
Before we jump into the implementation, let’s understand the basic usage of the module rfc4566. We will define a class named SDP to represent an SDP packet. SDP is a text-based protocol. An example SDP description from RFC4566 p.10 is shown below:
To implement the SDP class we first need to define how we intend to use the class. An object with dynamic properties that can be assessed either as attribute or container access forms a good programming interface.
We define the attrs class that implements such an attribute plus container interface for accessing the various headers in the SDP. Unlike the attribute access on a regular Python object, an attrs object returns None for a missing element instead of throwing an error. This helps the programmer in writing clean source code.
Then we derive the SDP class from this attrs class to extend the additional specific attributes such as connection line.
Certain attributes such as “t=”, “r=”, etc. can appear multiple times in SDP and need to be identified separately as done by the _multiple property of the SDP class.
From RFC4566 p.8 – Some lines in each description are REQUIRED and some are OPTIONAL, but all MUST appear in exactly the order given here (the fixed order greatly enhances error detection and allows for a simple parser).
Before defining the parsing of the full SDP data, let’s define the individual specific headers that can be represented using more than just a string.
From RFC4566 p.11 – Origin (o=)
o=<username> <sess-id> <sess-version> <nettype> <addrtype> <unicast-address>
The "o=" field gives the originator of the session (her username and the address of the user's host) plus a session identifier and version number:
<username> is the user's login on the originating host, or it is "-" if the originating host does not support the concept of user IDs. The <username> MUST NOT contain spaces.
<sess-id> is a numeric string such that the tuple of <username>, <sess-id>, <nettype>, <addrtype>, and <unicast-address> forms a globally unique identifier for the session. The method of <sess-id> allocation is up to the creating tool, but it has been suggested that a Network Time Protocol (NTP) format timestamp be used to ensure uniqueness.
<sess-version> is a version number for this session description. Its usage is up to the creating tool, so long as <sess-version> is increased when a modification is made to the session data. Again, it is RECOMMENDED that an NTP format timestamp is used.
<nettype> is a text string giving the type of network. Initially "IN" is defined to have the meaning "Internet", but other values MAY be registered in the future.
<addrtype> is a text string giving the type of the address that follows. Initially "IP4" and "IP6" are defined, but other values MAY be registered in the future.
<unicast-address> is the address of the machine from which the session was created. For an address type of IP4, this is either the fully qualified domain name of the machine or the dotted-decimal representation of the IP version 4 address of the machine. For an address type of IP6, this is either the fully qualified domain name of the machine or the compressed textual representation of the IP version 6 address of the machine. For both IP4 and IP6, the fully qualified domain name is the form that SHOULD be given unless this is unavailable, in which case the globally unique address MAY be substituted. A local IP address MUST NOT be used in any context where the SDP description might leave the scope in which the address is meaningful (for example, a local address MUST NOT be included in an application-level referral that might leave the scope).
In general, the "o=" field serves as a globally unique identifier for this version of this session description, and the subfields excepting the version taken together identify the session irrespective of any modifications.
Let’s define the originator class to represent the “o=” line and derive it from the attrs class so that it can also have dynamic attributes. The individual properties such as usename (str), sessionid (long), version (long), nettype (str), addrtype (str), address (str) are as described above. There are two methods of importance: the constructor __init__ which is used to parse the SDP line, and the string representation method __repr__ for format the SDP line.
If a value is supplied in the constructor it parses the SDP line into individual properties by splitting the value across white-space.
Otherwise if the value is not supplied in the constructor it assumes default values for the properties. For example, the address assumes local hostname or IP address, username is ‘-‘, sessionid and version are derived from the local time so that they are monotonically increasing, nettype and addrtype take the defaults ‘IN’ and ‘IP4’.
Converting an object of type originator into a string is straightforward – just join all the properties in the right order using white-space.
From RFC4566 p.14 – Connection Data ("c=")
c=<nettype> <addrtype> <connection-address>
The "c=" field contains connection data.
A session description MUST contain either at least one "c=" field in each media description or a single "c=" field at the session level. It MAY contain a single session-level "c=" field and additional "c=" field(s) per media description, in which case the per-media values override the session-level settings for the respective media.
The first sub-field ("<nettype>") is the network type, which is a text string giving the type of network. Initially, "IN" is defined to have the meaning "Internet", but other values MAY be registered in the future.
The second sub-field ("<addrtype>") is the address type. This allows SDP to be used for sessions that are not IP based. This memo only defines IP4 and IP6, but other values MAY be registered in the future.
The third sub-field ("<connection-address>") is the connection address. OPTIONAL sub-fields MAY be added after the connection address depending on the value of the <addrtype> field.
Sessions using an IPv4 multicast connection address MUST also have a time to live (TTL) value present in addition to the multicast address. The TTL and the address together define the scope with which multicast packets sent in this conference will be sent. TTL values MUST be in the range 0-255. Although the TTL MUST be specified, its use to scope multicast traffic is deprecated; applications SHOULD use an administratively scoped address instead.
The TTL for the session is appended to the address using a slash as a separator. An example is:
c=IN IP4 18.104.22.168/127
Multiple addresses or "c=" lines MAY be specified on a per-media basis only if they provide multicast addresses for different layers in a hierarchical or layered encoding scheme. They MUST NOT be specified for a session-level "c=" field. The slash notation for multiple addresses described above MUST NOT be used for IP unicast addresses.
The connection class derives from attrs and is used to represent the connection data described before. The individual properties are nettype (str), addrtype (str), address (str) and optionally ttl (int) and count (int). The constructor takes an optional string value. If the value is supplied, it is parsed into the individual properties. Alternatively, the application can construct the object by supplying the individual properties as attribute-value pairs.
As mentioned, the connection object can be created in two ways as shown below. The first option parses the value, whereas the second option takes the value of the individual properties. Certain properties have default value when created using the second option, e.g., addrtype is “IP4” and nettype is “IN”.
To format a connection object into string, we put the properties separated by spaces or other separator we needed in the following method.
From RFC4566 p.22 – Media Descriptions ("m=")
m=<media> <port> <proto> <fmt> ...
A session description may contain a number of media descriptions. Each media description starts with an "m=" field and is terminated by either the next "m=" field or by the end of the session description. A media field has several sub-fields:
<media> is the media type. Currently defined media are "audio", "video", "text", "application", and "message", although this list may be extended in the future.
<port> is the transport port to which the media stream is sent. The meaning of the transport port depends on the network being used as specified in the relevant "c=" field, and on the transport protocol defined in the <proto> sub-field of the media field. Other ports used by the media application (such as the RTP Control Protocol (RTCP) port ) MAY be derived algorithmically from the base media port or MAY be specified in a separate attribute (for example, "a=rtcp:").
If non-contiguous ports are used or if they don't follow the parity rule of even RTP ports and odd RTCP ports, the "a=rtcp:" attribute MUST be used. Applications that are requested to send media to a <port> that is odd and where the "a=rtcp:" is present MUST NOT subtract 1 from the RTP port: that is, they MUST send the RTP to the port indicated in <port> and send the RTCP to the port indicated in the "a=rtcp" attribute.
<proto> is the transport protocol. The meaning of the transport protocol is dependent on the address type field in the relevant "c=" field. Thus a "c=" field of IP4 indicates that the transport protocol runs over IP4.
RTP/AVP: denotes RTP used under the RTP Profile for Audio and Video Conferences with Minimal Control running over UDP.
The main reason to specify the transport protocol in addition to the media format is that the same standard media
<fmt> is a media format description. The fourth and any subsequent sub-fields describe the format of the media. The interpretation of the media format depends on the value of the <proto> sub-field.
If the <proto> sub-field is "RTP/AVP" or "RTP/SAVP" the <fmt> sub-fields contain RTP payload type numbers. When a list of payload type numbers is given, this implies that all of these payload formats MAY be used in the session, but the first of these formats SHOULD be used as the default format for the session. For dynamic payload type assignments the "a=rtpmap:" attribute SHOULD be used to map from an RTP payload type number to a media encoding name that identifies the payload format. The "a=fmtp:" attribute MAY be used to specify format parameters.
The media class derived from attrs class is used to represent the media description line and all the subsequent SDP lines that are attached to this media description line. The properties such as media (str), port (int), proto (str) and fmt (list) are defined as described above. The constructor, similar to the connection object, takes an optional value string. If the value is supplied, it gets parsed into individual properties, otherwise the named parameters in the argument list is used to populate the individual properties.
There are two ways to create a media object as shown below. In the first option the supplied value string is parsed, and in the second option the parameters populate the properties of the object. In the second option certain parameters take the default values, e.g., default values for port and proto are 0 and ‘RTP/AVP’ respectively.
Since the media object also stores the media description specific attributes, the formatting is slightly more complicated to generate multiple SDP lines. Secondly, the format description attributes are stored differently than the other attributes.
To format a media object we first print the media description (“m=”) SDP line using the media, port, proto and fmt properties. Only the payload type (pt) property is used from individual elements in the fmt format list.
Then it prints out the additional headers such as “i=”, “c=”, “b=”, “k=” and various “a=” SDP lines that are associated with this media description object. If the header is a multiple instance header then it can occur multiple times, and the value is assumed to be a list.
Finally, the “a=rtpmap:” attributes are appended from the fmt format list. Because of the ordering restrictions on the headers, this should appear at the end. The formatted string is then returned as the formatted media description which contains the value of the “m=” line followed by name and value of all the other SDP lines that are associated with this “m=” line. Note that the header name, “m”, and equals character, “=”, are not present in the returned string representing the value of this media object.
Now that we have defined the basic components, let’s define the internal parsing routine for the SDP class.
From RFC4566 p.8 – An SDP session description consists of a number of lines of text of the form:
where <type> MUST be exactly one case-significant character and <value> is structured text whose format depends on <type>. In general, <value> is either a number of fields delimited by a single space character or a free format string, and is case-significant unless a specific field defines otherwise. Whitespace MUST NOT be used on either side of the "=" sign.
An SDP session description consists of a session-level section followed by zero or more media-level sections. The session-level part starts with a "v=" line and continues to the first media-level section. Each media-level section starts with an "m=" line and continues to the next media-level section or end of the whole session description. In general, session-level values are the default for all media unless overridden by an equivalent media-level value.
The connection ("c=") and attribute ("a=") information in the session-level section applies to all the media of that session unless overridden by connection information or an attribute of the same name in the media description.
The following method takes the text string to parse into this SDP object. First we split the string into individual lines. Care must be taken in treating “\n” as same as “\r\n” for interoperability with implementations that generate “\n” as line termination instead of “\r\n”. Since various attributes can be either global session attribute or media specific attribute, depending on whether they appear before any “m=” line or after, we need to keep a state variable, g, to indicate whether we are parsing the global session context or the local media description context.
Each line is then split into the header name and value. Note that instead of using the split method we use the partition method, because the partition needs to be done only once across the given token “=” instead of tokenizing the string using the split method. The strtok and strtok_r functions in the C programming language are equivalent to the split method of Python, and should be used with care.
If the header name is recognized to be implemented by the specific classes we defined earlier, then we create those specific objects such as originator, connection and media, to parse the header value.
Since there can be multiple instances of the “m=” line in the SDP data, the property m is defined as a list. Each element in the list is of type media object. Since the attributes can be either in the global session context or in the local media description context, we also identify the context for an attribute. In particular, if property m doesn’t exist then we are in the global context, otherwise we are in the media context.
At this point the obj variable points to the appropriate context, either the global SDP object or the local media object, to which the new header needs to be added.
Adding the new header in the global context is straight forward – if the header is multiple instance header then create a list and append the value to the list, otherwise set the value of the header name property in the SDP object. When accessing the property, a multiple-instance header returns a list of string values whereas a single instance header returns the string value, e.g., SDP.a is a list whereas SDP.s is a single string value.
Adding a new header line in the media context is also similar, with one exception. If the header represents a “a=rtpmap:” line, then that needs to be parsed into the format fmt list of the media object.
From RFC4566 p.25 – a=rtpmap:<payload type> <encoding name>/<clock rate> [/<encoding parameters>]
This attribute maps from an RTP payload type number (as used in an "m=" line) to an encoding name denoting the payload format to be used. It also provides information on the clock rate and encoding parameters. It is a media-level attribute that is not dependent on charset.
Although an RTP profile may make static assignments of payload type numbers to payload formats, it is more common for that assignment to be done dynamically using "a=rtpmap:" attributes. As an example of a static payload type, consider u-law PCM coded single-channel audio sampled at 8 kHz. This is completely defined in the RTP Audio/Video profile as payload type 0, so there is no need for an "a=rtpmap:" attribute, and the media for such a stream sent to UDP port 49232 can be specified as:
An example of a dynamic payload type is 16-bit linear encoded stereo audio sampled at 16 kHz. If we wish to use the dynamic RTP/AVP payload type 98 for this stream, additional information is required to decode it:
Up to one rtpmap attribute can be defined for each media format specified. Thus, we might have the following:
RTP profiles that specify the use of dynamic payload types MUST define the set of valid encoding names and/or a means to register encoding names if that profile is to be used with SDP.
For audio streams, <encoding parameters> indicates the number of audio channels. This parameter is OPTIONAL and may be omitted if the number of channels is one, provided that no additional parameters are needed.
For video streams, no encoding parameters are currently specified.
Formatting a SDP data is relatively easy. The order of the headers are important. A multiple-instance header is stored as a list and may return in multiple SDP lines. The method to format an SDP is written below.
Once we have finished the implementation of the SDP class, we can test the parsing and formatting function as follows:
Now that we have described the implementation of SDP, let’s move on to using it in SIP telephony. As mentioned before RFC3264 defines the offer-answer model which is used in the SIP session negotiation between two parties.
From RFC3264 p.1 – This document defines a mechanism by which two entities can make use of the Session Description Protocol (SDP) to arrive at a common view of a multimedia session between them. In the model, one participant offers the other a description of the desired session from their perspective, and the other participant answers with the desired session from their perspective. This offer/answer model is most useful in unicast sessions where information from both participants is needed for the complete view of the session. The offer/answer model is used by protocols like the Session Initiation Protocol (SIP).
The means by which the offers and answers are conveyed are outside the scope of this document. The offer/answer model defined here is the mandatory baseline mechanism used by the Session Initiation Protocol (SIP).
We implement the offer-answer model in our module named rfc3264.
Before implementing the module, let’s list down the expected behavior of the module. The module should define two methods: createOffer and createAnswer, to create session description for an offer or answer respectively. We reuse the SDP and media definitions from the previous module rfc4566.
Media can be described using the media object. The following code defines two media objects, one for audio and other for video. The audio has two formats: PCMU and PCMA whereas video has one format H.261.
Now the application can create a new offer using these media description as follows.
To test if the offer contains a valid SDP object, you can print the offer.
When the offer is received by the answerer, it can use the following code to generate the answer SDP. Support that the answerer wants to support PCMU and GSM audio but no video.
Now suppose that the offerer wants to change the offer, e.g., using SIP re-INVITE, by removing video from the offer, it should reuse the previous offer as follows.
Thus, the offer can be created either from empty state or from a previous offer, whereas an answer is always created from a previous offer.
To start the implementation, please note that we need to use the definitions from the rfc4566 module. Although RFC 3264 uses old specification of SDP as in RFC 2327, we use the new specification of SDP as in RFC 4566.
We also define a module level flag to enable or disable the trace which helps us in debugging the module. The default is to disable the trace, but a programme may enable it by setting it to True.
From RFC3264 p.4 – Media Stream: From RTSP , a media stream is a single media instance, e.g., an audio stream or a video stream as well as a single whiteboard or shared application group. In SDP, a media stream is described by an "m=" line and its associated attributes.
We use the media class defined in SDP to represent the media stream.
The offer (and answer) MUST be a valid SDP message, as defined by RFC 2327, with one exception. RFC 2327 mandates that either an e or a p line is present in the SDP message. This specification relaxes that constraint; an SDP formulated for an offer/answer application MAY omit both the e and p lines. The numeric value of the session id and version in the o line MUST be representable with a 64 bit signed integer. The initial value of the version MUST be less than (2**62)-1, to avoid rollovers. Although the SDP specification allows for multiple session descriptions to be concatenated together into a large SDP message, an SDP message used in the offer/answer model MUST contain exactly one session description.
The SDP "s=" line conveys the subject of the session, which is reasonably defined for multicast, but ill defined for unicast. For unicast sessions, it is RECOMMENDED that it consist of a single space character (0x20) or a dash (-).
Unfortunately, SDP does not allow the "s=" line to be empty.
The SDP "t=" line conveys the time of the session. Generally, streams for unicast sessions are created and destroyed through external signaling means, such as SIP. In that case, the "t=" line SHOULD have a value of "0 0".
The offer will contain zero or more media streams (each media stream is described by an "m=" line and its associated attributes). Zero media streams implies that the offerer wishes to communicate, but that the streams for the session will be added at a later time through a modified offer. The streams MAY be for a mix of unicast and multicast; the latter obviously implies a multicast address in the relevant "c=" line(s).
Construction of each offered stream depends on whether the stream is multicast or unicast.
We simplify our implementation to support only the unicast addresses, and not worry about various headers. The following implementation just matches the media description lines and the format list correctly from the offer and the supplied locally supported media streams.
From RFC3264 p.9 – The answer to an offered session description is based on the offered session description. If the answer is different from the offer in any way (different IP addresses, ports, etc.), the origin line MUST be different in the answer, since the answer is generated by a different entity. In that case, the version number in the "o=" line of the answer is unrelated to the version number in the o line of the offer.
The "t=" line in the answer MUST equal that of the offer. The time of the session cannot be negotiated.
For each "m=" line in the offer, there MUST be a corresponding "m=" line in the answer. The answer MUST contain exactly the same number of "m=" lines as the offer. This allows for streams to be matched up based on their order. This implies that if the offer contained zero "m=" lines, the answer MUST contain zero "m=" lines.
If a stream is offered with a unicast address, the answer for that stream MUST contain a unicast address. The media type of the stream in the answer MUST match that of the offer.
If a stream is offered as sendonly, the corresponding stream MUST be marked as recvonly or inactive in the answer. If a media stream is listed as recvonly in the offer, the answer MUST be marked as sendonly or inactive in the answer. If an offered media stream is listed as sendrecv (or if there is no direction attribute at the media or session level, in which case the stream is sendrecv by default), the corresponding stream in the answer MAY be marked as sendonly, recvonly, sendrecv, or inactive. If an offered media stream is listed as inactive, it MUST be marked as inactive in the answer.
For streams marked as recvonly in the answer, the "m=" line MUST contain at least one media format the answerer is willing to receive with from amongst those listed in the offer. The stream MAY indicate additional media formats, not listed in the corresponding stream in the offer, that the answerer is willing to receive. For streams marked as sendonly in the answer, the "m=" line MUST contain at least one media format the answerer is willing to send from amongst those listed in the offer. For streams marked as sendrecv in the answer, the "m=" line MUST contain at least one codec the answerer is willing to both send and receive, from amongst those listed in the offer. The stream MAY indicate additional media formats, not listed in the corresponding stream in the offer, that the answerer is willing to send or receive (of course, it will not be able to send them at this time, since it was not listed in the offer). For streams marked as inactive in the answer, the list of media formats is constructed based on the offer. If the offer was sendonly, the list is constructed as if the answer were recvonly. Similarly, if the offer was recvonly, the list is constructed as if the answer were sendonly, and if the offer was sendrecv, the list is constructed as if the answer were sendrecv. If the offer was inactive, the list is constructed as if the offer were actually sendrecv and the answer were sendrecv.
The connection address and port in the answer indicate the address where the answerer wishes to receive media (in the case of RTP, RTCP will be received on the port which is one higher unless there is an explicit indication otherwise). This address and port MUST be present even for sendonly streams; in the case of RTP, the port one higher is still used to receive RTCP.
In the case of RTP, if a particular codec was referenced with a specific payload type number in the offer, that same payload type number SHOULD be used for that codec in the answer. Even if the same payload type number is used, the answer MUST contain rtpmap attributes to define the payload type mappings for dynamic payload types, and SHOULD contain mappings for static payload types. The media formats in the "m=" line MUST be listed in order of preference, with the first format listed being preferred. In this case, preferred means that the offerer SHOULD use the format with the highest preference from the answer.
Although the answerer MAY list the formats in their desired order of preference, it is RECOMMENDED that unless there is a specific reason, the answerer list formats in the same relative order they were present in the offer. In other words, if a stream in the offer lists audio codecs 8, 22 and 48, in that order, and the answerer only supports codecs 8 and 48, it is RECOMMENDED that, if the answerer has no reason to change it, the ordering of codecs in the answer be 8, 48, and not 48, 8. This helps assure that the same codec is used in both directions.
The interpretation of fmtp parameters in an offer depends on the parameters. In many cases, those parameters describe specific configurations of the media format, and should therefore be processed as the media format value itself would be. This means that the same fmtp parameters with the same values MUST be present in the answer if the media format they describe is present in the answer. Other fmtp parameters are more like parameters, for which it is perfectly acceptable for each agent to use different values. In that case, the answer MAY contain fmtp parameters, and those MAY have the same values as those in the offer, or they MAY be different. SDP extensions that define new parameters SHOULD specify the proper interpretation in offer/answer.
The answerer MAY include a non-zero ptime attribute for any media stream; this indicates the packetization interval that the answerer would like to receive. There is no requirement that the packetization interval be the same in each direction for a particular stream.
The answerer MAY include a bandwidth attribute for any media stream; this indicates the bandwidth that the answerer would like the offerer to use when sending media. The value of zero is allowed, interpreted as described in Section 5.
If the answerer has no media formats in common for a particular offered stream, the answerer MUST reject that media stream by setting the port to zero.
An offered stream MAY be rejected in the answer, for any reason. If a stream is rejected, the offerer and answerer MUST NOT generate media (or RTCP packets) for that stream. To reject an offered stream, the port number in the corresponding stream in the answer MUST be set to zero. Any media formats listed are ignored. At least one MUST be present, as specified by SDP.
If there are no media formats in common for all streams, the entire offered session is rejected.
Once the answerer has sent the answer, it MUST be prepared to receive media for any recvonly streams described by that answer. It MUST be prepared to send and receive media for any sendrecv streams in the answer, and it MAY send media immediately. The answerer MUST be prepared to receive media for recvonly or sendrecv streams using any media formats listed for those streams in the answer, and it MAY send media immediately. When sending media, it SHOULD use a packetization interval equal to the value of the ptime attribute in the offer, if any was present. It SHOULD send media using a bandwidth no higher than the value of the bandwidth attribute in the offer, if any was present. The answerer MUST send using a media format in the offer that is also listed in the answer, and SHOULD send using the most preferred media format in the offer that is also listed in the answer. In the case of RTP, it MUST use the payload type numbers from the offer, even if they differ from those in the answer.
In this chapter we have implemented the session description protocol and offer-answer model that are needed for SIP telephony. Next we describe the basic and digest authentication.
Implementing RFC 2617 for Basic and Digest authentication
SIP uses the authentication mechanism defined for HTTP. In particular, the digest authentication defined in RFC2617 provides a challenge-response authentication that does not send the password in clear text.
We implement the authentication module named rfc2617. We would like to support both “Basic” and “Digest” authentication method defined in RFC2617. Although SIP does not allow “Basic” authentication because it sends the password in clear, we do implement the mechanism as it can work well with underlying transport security between the client and the server.
Before we implement the module, let’s discuss the expected usage of the module. When a client (UAC) sends a SIP request to the server (UAS), the server may challenge the request by responding with a 401 or 407 response. The server puts the WWW-Authenticate or Proxy-Authenticate header in the response. Let’s assume that the server invokes the createAuthenticate method to create the header value.
When the client wants to re-send the request with the authorization credentials, it puts the Authorization or Proxy-Authorization header in the new request which supplies the credentials. It invokes the createAuthorization method to create the header value.
Let’s now focus on implementing these two public methods in our module.
From RFC2617 p.3 – HTTP provides a simple challenge-response authentication mechanism that MAY be used by a server to challenge a client request and by a client to provide authentication information. It uses an extensible, case-insensitive token to identify the authentication scheme, followed by a comma-separated list of attribute-value pairs which carry the parameters necessary for achieving authentication via that scheme.
auth-scheme = token
auth-param = token "=" ( token | quoted-string )
Let’s define the quote and unquote internal methods that can quote or unquote a string if needed.
The method takes the authMethod argument which is either “Basic” or “Digest” (case-insensitive), followed by bunch of named parameters. Possible parameter names are realm, domain, qop, nonce, opaque, stale and algorithm. Usually the realm is mandatory for “Basic” authentication, and realm and domain for “Digest”. Other parameters if needed but not specified, take the default values.
The “Basic” authentication’s header value is straightforward which just puts the realm as quoted string in the authentication parameters.
The “Digest” authentication creates the list of authentication parameters from the supplied values or the defaults, such that the parameters are put in order specified below. I have seen some implementation that doesn’t interoperate if the order of the parameters is not same as what is presented in the specification. Only the stale and algorithm parameters are unquoted, others are quoted strings.
The method gives an error if the authMethod is unsupported.
From RFC2617 p.3 – The 401 (Unauthorized) response message is used by an origin server to challenge the authorization of a user agent. This response MUST include a WWW-Authenticate header field containing at least one challenge applicable to the requested resource. The 407 (Proxy Authentication Required) response message is used by a proxy to challenge the authorization of a client and MUST include a Proxy-Authenticate header field containing at least one challenge applicable to the proxy for the requested resource.
challenge = auth-scheme 1*SP 1#auth-param
A user agent that wishes to authenticate itself with an origin server--usually, but not necessarily, after receiving a 401 (Unauthorized)--MAY do so by including an Authorization header field with the request. A client that wishes to authenticate itself with a proxy--usually, but not necessarily, after receiving a 407 (Proxy Authentication Required)--MAY do so by including a Proxy-Authorization header field with the request. Both the Authorization field value and the Proxy-Authorization field value consist of credentials containing the authentication information of the client for the realm of the resource being requested. The user agent MUST choose to use one of the challenges with the strongest auth-scheme it understands and request credentials from the user based upon that challenge.
credentials = auth-scheme #auth-param
The following method builds the Authorization header value for the specified challenge. The challenge argument must be a string representing the WWW-Authenticate (or Proxy-Authenticate) header value. The method parses it to identify the various authentication parameters. The other arguments are as follows: the username and password parameters supply the credentials for authentication, the uri, method and entityBody parameters supply those properties of the request which are used in building the digest credentials, and finally the context argument is used to store the state for digest authorization, such as cnonce and nonceCount, if available.
From RFC2617 p.5 – The "basic" authentication scheme is based on the model that the client must authenticate itself with a user-ID and a password for each realm. The realm value should be considered an opaque string which can only be compared for equality with other realms on that server. The server will service the request only if it can validate the user-ID and password for the protection space of the Request-URI. There are no optional authentication parameters. For Basic, the framework above is utilized as follows:
challenge = "Basic" realm
credentials = "Basic" basic-credentials
Upon receipt of an unauthorized request for a URI within the protection space, the origin server MAY respond with a challenge like the following:
WWW-Authenticate: Basic realm="WallyWorld"
where "WallyWorld" is the string assigned by the server to identify the protection space of the Request-URI. A proxy may respond with the same challenge using the Proxy-Authenticate header field.
We delegate this function into the basic method defined later.
From RFC2617 p.6 – Like Basic Access Authentication, the Digest scheme is based on a simple challenge-response paradigm. The Digest scheme challenges using a nonce value. A valid response contains a checksum (by default, the MD5 checksum) of the username, the password, the given nonce value, the HTTP method, and the requested URI. In this way, the password is never sent in the clear. Just as with the Basic scheme, the username and password must be prearranged in some fashion not addressed by this document.
If a server receives a request for an access-protected object, and an acceptable Authorization header is not sent, the server responds with a "401 Unauthorized" status code, and a WWW-Authenticate header as per the framework defined above, which for the digest scheme is utilized as follows:
challenge = "Digest" digest-challenge
digest-challenge = 1#( realm | [ domain ] | nonce |
[ opaque ] |[ stale ] | [ algorithm ] |
[ qop-options ] | [auth-param] )
domain = "domain" "=" <"> URI ( 1*SP URI ) <">
URI = absoluteURI | abs_path
nonce = "nonce" "=" nonce-value
nonce-value = quoted-string
opaque = "opaque" "=" quoted-string
stale = "stale" "=" ( "true" | "false" )
algorithm = "algorithm" "=" ( "MD5" | "MD5-sess" | token )
qop-options = "qop" "=" <"> 1#qop-value <">
qop-value = "auth" | "auth-int" | token
The client is expected to retry the request, passing an Authorization header line, which is defined according to the framework above, utilized as follows.
credentials = "Digest" digest-response
digest-response = 1#( username | realm | nonce |
digest-uri | response | [ algorithm ] |
[cnonce] | [opaque] | [message-qop] |
[nonce-count] | [auth-param] )
username = "username" "=" username-value
username-value = quoted-string
digest-uri = "uri" "=" digest-uri-value
digest-uri-value = request-uri ; As specified by HTTP/1.1
message-qop = "qop" "=" qop-value
cnonce = "cnonce" "=" cnonce-value
cnonce-value = nonce-value
nonce-count = "nc" "=" nc-value
nc-value = 8LHEX
response = "response" "=" request-digest
In this document the string obtained by applying the digest algorithm to the data "data" with secret "secret" will be denoted by KD(secret, data), and the string obtained by applying the checksum algorithm to the data "data" will be denoted H(data). The notation unq(X) means the value of the quoted-string X without the surrounding quotes.
For the "MD5" and "MD5-sess" algorithms
H(data) = MD5(data)
KD(secret, data) = H(concat(secret, ":", data))
The first time the client requests the document, no Authorization header is sent, so the server responds with:
HTTP/1.1 401 Unauthorized
The client may prompt the user for the username and password, after which it will respond with a new request, including the following Authorization header:
Authorization: Digest username="Mufasa",
We define the digest method to create such a digest response.
If the "algorithm" directive's value is "MD5" or is unspecified, then A1 is:
A1 = unq(username-value) ":" unq(realm-value) ":" passwd
passwd = < user's password >
If the "algorithm" directive's value is "MD5-sess", then A1 is calculated only once - on the first request by the client following receipt of a WWW-Authenticate challenge from the server. It uses the server nonce from that challenge, and the first client nonce value to construct A1 as follows:
A1 = H( unq(username-value) ":" unq(realm-value)
":" passwd )
":" unq(nonce-value) ":" unq(cnonce-value)
If the "qop" directive's value is "auth" or is unspecified, then A2 is:
A2 = Method ":" digest-uri-value
If the "qop" value is "auth-int", then A2 is:
A2 = Method ":" digest-uri-value ":" H(entity-body)
If the "qop" value is "auth" or "auth-int":
request-digest = <"> < KD ( H(A1), unq(nonce-value)
If the "qop" directive is not present (this construction is for compatibility with RFC 2069):
<"> < KD ( H(A1), unq(nonce-value) ":" H(A2) ) > <">
If the user agent wishes to send the userid "Aladdin" and password "open sesame", it would use the following header field:
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
To receive authorization, the client sends the userid and password, separated by a single colon (":") character, within a base64 encoded string in the credentials.
basic-credentials = base64-user-pass
base64-user-pass = <base64  encoding of user-pass,
except not limited to 76 char/line>
user-pass = userid ":" password
userid = *<TEXT excluding ":">
password = *TEXT
Userids might be case sensitive.
The authentication module forms an integral part of any SIP implementation for both client as well as server side. Next we explore the client specific extensions for SIP telephony.
this part extends the basic implementation to support various client-side features such as media transport, traversal of network address translator (NAT) and firewall, instant messaging and presence, contact list management and audio-video tools.
Implementing RTP/RTCP as per RFC 3550, RFC 3551
The Real-time Transport Protocol (RTP) defines a standardized packet format for delivering audio and video over the Internet. It is used for several internet protocols such as RTSP for streaming and SIP for multimedia sessions.
From RFC3550 p.1 – This memorandum describes RTP, the real-time transport protocol. RTP provides end-to-end network transport functions suitable for applications transmitting real-time data, such as audio, video or simulation data, over multicast or unicast network services. RTP does not address resource reservation and does not guarantee quality-of-service for real-time services. The data transport is augmented by a control protocol (RTCP) to allow monitoring of the data delivery in a manner scalable to large multicast networks, and to provide minimal control and identification functionality. RTP and RTCP are designed to be independent of the underlying transport and network layers. The protocol supports the use of RTP-level translators and mixers.
Most of the text in this memorandum is identical to RFC 1889 which it obsoletes. There are no changes in the packet formats on the wire, only changes to the rules and algorithms governing how the protocol is used. The biggest change is an enhancement to the scalable timer algorithm for calculating when to send RTCP packets in order to minimize transmission in excess of the intended rate when many participants join a session simultaneously.
The RTP specification inserts a header, typically 12 bytes long, to the audio video payload. This RTP header provides synchronization, timing and sequencing information. RTP works in conjunction with another protocol, namely Real-time Transport Control Protocol (RTCP). RTCP is used to provide various quality feedback and synchronization information. The base specification works for both unicast as well as multicast applications. The implementation of base RTP is straight forward. However, the implementation of RTCP is more involved. Unlike the other standards such as SIP, the specification of RTP and RTCP is presented in the RFCs in very low level details, including source code in C programming language. This helps us for those parts where we can readily port the C source code to Python programming language for our implementation.
In this chapter we will implement RTP and RTCP as per RFC 3550. We will also implement the audio video profile as defined in RFC 3551. Let’s create new module named rfc3550 and rfc3551 to implement these two specifications.
At the high level there are four parts in the rfc3550 module: (1) the RTP and RTCP classes define the packet format for RTP and RTCP, respectively, including parsing and formatting, (2) the Session class defines the control behavior for an RTP session, (3) the Source class represents a member or source in a session, and (4) the Network class abstracts out the network behavior such as a pair sockets, hence allows us to keep the network transport outside our implementation.
In our module we will use a number of existing libraries such as struct for binary packet format, random for random number generation, math for various math operations, time for getting the current time, and socket for getting the IP address and performing network transport.
Let’s also define a convenience flag to enable or disable the trace in our module.
The packet format for RTP and RTCP follows the binary protocol mechanism. Let’s define a convenience function to print some data in binary format, to help us debug out module. Let’s assume the binstr function converts the supplied string into its binary representation with up to 32 bits per line. The specification also assumes 32-bits boundary for various headers.
We implement this function in two steps: first we define a method called binary which converts the supplied data into list of strings, where each string is the binary representation of the specific number of consecutive bytes as controlled by the size argument. For example, calling binary(data, size=4) will return lists containing binary representations of all the 32-bits words in the data.
In the second step we define a method to convert this list into a single multi-line string for printing purpose.
From RFC3550 p.8 – RTP packet: A data packet consisting of the fixed RTP header, a possibly empty list of contributing sources (see below), and the payload data. Some underlying protocols may require an encapsulation of the RTP packet to be defined. Typically one packet of the underlying protocol contains a single RTP packet, but several RTP packets MAY be contained if permitted by the encapsulation method.
Let’s assume that the RTP class represents an RTP packet. There are two important functions: parsing and formatting. An RTP object can be constructed either from individual elements of the RTP header or from the received data. In the latter case, it parses the data into the RTP object. The formatting function can be implemented using the __repr__ method to get the string (binary) representation of the object. This allows us to use the object in string context, where it automatically gets the binary formatted value of the packet.
The following example shows how to construct an RTP packet by specifying the individual header fields using named parameters. The extn argument supplies the length as well as the value in a tuple, and the payload argument supplies the value in binary form.
By printing the hexadecimal representation of the packet, we can confirm that our packet was well formed. In particular, you can check the various headers, the extension field, payload and the final padding byte.
To further verify the functions, let’s construct another RTP packet using the value of the first packet.
We can print the individual headers fields of the packet to verify that the values are what were set in the original packet.
The following example demonstrates the binary representation of the RTP packet.
Let’s define the RTP class. As mentioned above, the constructor takes overloaded set of arguments: either the value argument can be supplied containing the binary packet, or the individual header fields can be supplied. These individual header fields are stored as properties in the RTP object. The pt or payload type is an integer 0-127. The seq property is a two-bytes integer representing the sequence number. The ts property is a four-bytes integer representing the timestamp. The ssrc property is a four-bytes integer representing the synchronization source identifier. The csrcs is a list of four-bytes integers, representing the various contributing source identifiers, if any. The marker property is a Boolean indicating whether the marker is set or not. The extn optional property is a tuple, with first element indicating the length and the second element representing the actual binary data for the extension. The payload property represents the actual binary payload data in this packet.
If value argument is supplied, we parse the value into various header field properties as follows. The minimum header size is 12 bytes, otherwise it gives an error. The RTP version number must be 2 otherwise it gives and error.
The first 12 bytes are unpacked into the initial mandatory headers.
This is followed by an optional list of CSRCs.
If an extension is present it is parsed into the extn property.
Finally, the payload is stored in the payload property. Note that if padding is present, the padding bytes are not included in the payload.
Formatting the RTP object into binary format can be done in a single Python statement as shown below. This example shows the power of the programming language for this kind of implementations.
RTCP packet: A control packet consisting of a fixed header part similar to that of RTP data packets, followed by structured elements that vary depending upon the RTCP packet type. Typically, multiple RTCP packets are sent together as a compound RTCP packet in a single packet of the underlying protocol; this is enabled by the length field in the fixed header of each RTCP packet.
Let’s assume that the RTCP class implements a compound RTCP packet. For representing an individual packet or a sub-packet we assume the nested class RTCP.packet. As with an RTP packet we would like to be able to create an RTCP packet using individual header components. The following example creates a new sender report packet.
Similarly, we can create the receiver report with two report elements as follows:
As you can see, the RTCP.packet class can be used for many purposes. It defines dynamic attribute as well as container syntax for the properties, similar to the SDP class implemented in earlier chapter.
The RTCP SDES packet can be created as follows. Each item is a tuple, with a list of attributes such as CNAME, NAME, PHONE, etc.
An RTCP BYE packet can be created as follows.
The compound RTCP packet, with list semantics, can be created from these individual packets by supplying the list of individual packets.
For parsing an RTCP packet, you can construct the object using a single binary string argument. For example, we create p2 by formatting and parsing back the original compound packet p1.
Let’s walk through some more functions in RTCP packet. If you know the number of individual packets in the compound packet, you can use the list semantics on the compound packet to extract the individual packet. For example,
We can also access the various properties of the objects or sub-objects, either with attribute or with container access. Some examples follow, the results of which you can compare with the original values we set in our exercise.
Let’s implement the RTCP class as a sub-class of list, so that it inherits the list semantics and syntax for representing the compound packet.