Module Kwrefer

module Kwrefer: sig .. end

Parse and Generate Extended Refer Databases

Refer is an excellent low-noise, easy to edit, flat-file data format for non-recursive key-value data with repeating and optional fields. See Refer Data Format for more information.

Refer databases (collections of files of refer records) can be validated against a schema.

Parsed refer records are represented by this module as association lists (alists). The order of the fields in the record is preserved.

A refer database on an input channel is best processed with Kwrefer.fold. It can also be processed as a stream of char (using Stream.of_channel or Stream.of_string) with Kwrefer.records, which converts the char stream to a stream of refer records. In addition to List.assoc and friends, Kwlist.asplit and Kwlist.coalesce may be useful.

You can convert an alist to a string in refer format with Kwrefer.assemble; writing a bunch of alists out as a refer database can be done via print_endline & assemble.
Author(s): Keith Waclena
See also


exception Syntax of string option * string
Exception raised when syntax errors encountered in parsing: Syntax (location, explanation)
exception Err of int * string
Exception raised when syntax errors encountered in parsing: Err (ln, line)
exception Invalid_record of string
Exception raised when a parsed record doesn't match the semantics expected by some utility function (e.g. doesn't match a schema).

Parsing Refer Records and Converting to Internal Form


val fold : ?err:(string -> int -> 'a -> string list -> 'a) ->
(int -> 'a -> (string * string) list -> 'a) -> 'a -> Kwchan.src -> 'a
fold ?err f init src: fold function f over the refer records in src.

The err function's first parameter is the invalid line of text from the input src; the final parameters is the list of valid lines accumulated so far (this parameter is probably not of interest). Typically, you would report the first parameter as context for an error message.
Raises Err upon syntax errors

err : function f badline ln acc lines to call on syntax errors; the default is to raise Kwrefer.Err
val withoutln : ('a -> 'b -> 'c) -> 'a -> 'd -> 'b -> 'c
withoutln f: convert a function suitable for Kw.foldl into one suitable for Kwrefer.fold.
val records : ?loc:string -> char Stream.t -> (int * (string * string) list) Stream.t
records ?loc stream: convert a stream of characters to a stream of pairs of (line number, parsed record (alist)).

The line number represents the beginning of the record in the stream of characters.

loc : location (typically filename) to be helpfully added to exceptions
val parse : ?loc:string -> string -> (string * string) list
parse ?loc str: parse a single refer record (a string) to an alist.

This function is relatively inefficient; Kwrefer.fold or Kwrefer.records will be much faster for parsing entire files.
Returns pair of (line number, parsed record (alist))

loc : location (typically filename) to be helpfully added to exceptions

Converting Alists to Strings in Refer Format


val assemble : (string * string) list -> string
assemble alist: format a refer record (a string) from an alist

Extracting Data from Alists


val keys : ('a * 'b) list -> 'a list
keys alist: get all (unique) keys in record (alist)
val keycounts : ('a * 'b) list -> ('a * int) list
keycounts alist: return counts of keys in a parsed record.
Returns an alist of (key,repeat-count)
val getall : 'a -> ('a * 'b) list -> 'b list
getall key alist: return list of all values in alist corresponding to key

Schemas and Validation

A refer database is a collection of refer records, which can be represented externally as one or more refer files. These databases can be validated against a schema. Schemas associate sets of properties with fields.

A schema can be represented as a refer record; the compilation functions below will parse such a record as a schema (performing validation checks on the schema itself) and return the internal representation as type Kwrefer.schema. You can of course build compiled schemas from ad hoc strings or alists.

The validation functions below can then validate refer records against a compiled schema, returning a list of errors (if any).

In addition to a simple schema, applied to each record of a database, we support multischemas that can be applied to refer databases containing different types of records that are distinguishable by some field (called the schema key field or skey). This is not a normal KEY field because its values aren't unique across the database.

A multischema is itself a refer database consisting of multiple schemas, each identified by a true KEY field whose value is one of the skey values.

A keyed database is validated against a multischema by using each record's skey to lookup the appropriate schema to be used to validate that record.

Fields of schema records can't contain arbitrary values; they can only contain:

For example:
    %R author                           COMMENT schema key
    %A KEY REQ                          COMMENT author's name
    %D OPT UNIQ                         COMMENT author's dates: born-died
    %B OPT UNIQ ENUM complete,select    COMMENT state of bibliography
   

Validation occurs as follows. A simple schema or multi-schema must be compiled. If compiled from a file, a simple schema must be the only record in the file; each record of a multi-schema must have an skey field as named by the ~skey parameter of the compilation function; the skey fields of the multi-schema are KEYS and must have unique values.

If validating with a simple schema, the schema is applied to each record of the database; if a multi-schema, then the schema record that matches the skey field of the database record is used.

Most properties concern the presence or absence of fields, and their number, without regard to their values.

The fields of the record must accord with the properties of the corresponding field in the schema record. Each REQ field must occur in every database record; each UNIQ field can only occur once per record. OPT fields may be absent, and REP fields may occur multiple times.

Fields with an ENUM property place a constraint on the field value (if the field is present): the field value must be one of the possible values enumerated in the schema. For example, if the schema contains ENUM foo,bar,baz for a field A, then each value of each A field must be one of the simple strings foo, bar, or baz. Any other value is illegal. UNIQ and REQ properties control the presence of the fields as usual. ENUM values can't contain spaces or commas.

Fields with a KEY property also place a constraint on the field value. The values of KEY fields must be unique across all records in the database. It is common for KEY fields to also be declared REQ and UNIQ, but this isn't mandatory. A KEY field that's not UNIQ allows a record to have multiple keys. A KEY field that's not REQ allows "anonymous" records.

Any fields not mentioned in the schema are ignored, unless ~strict validation is being used, in which case any unmentioned fields generate Illegal errors.

Schema Properties


type prop = 
| OPT (*
optional field (mutually exclusive with REQ)
*)
| REQ (*
required field (mutually exclusive with OPT)
*)
| UNIQ (*
unique field (mutually exclusive with REP)
*)
| REP (*
repeatable field (mutually exclusive with UNIQ)
*)
| KEY (*
key field; values must be unique across entire database
*)
| ENUM (*
field value is restricted to enumeration
*)
| COMMENT (*
comment extending to EOL
*)
The type of schema property names.
type pprop 
val prop_of_string : string -> prop
val string_of_prop : prop -> string
module PS: Kwset.S  with type elt = pprop
Sets of properties.
module SS: Kwset.S  with type elt = string
Sets of strings (used for field names).
module SM: Kwmap.S  with type key = string
Maps with string keys.
module PM: Kwmap.S  with type key = pprop
Maps with property keys.
type validation_error = 
| Key of (string option * int * string * string) (*
KEY field has non-unique value
*)
| Illegal of (string option * int * string) (*
field not allowed by strict application of schema
*)
| Missing of (string option * int * string) (*
REQ field missing
*)
| Repeat of (string option * int * string) (*
UNIQ field repeated
*)
| Enum of (string option * int * string) (*
field value not in ENUM
*)
| Skey of (string option * int * string * string) (*
schema key value doesn't exist in multischema
*)
Errors that can occur when validating refer records.

string option's are ?loc's (e.g. filenames). int's are line numbers.


val string_of_validation_error : validation_error -> string
type schemaerror = 
| REQOPT of (int * string) (*
schema field has conflicting properties (REQ and OPT)
*)
| UNIQREP of (int * string) (*
schema field has conflicting properties (UNIQ and REP)
*)
| BADPROP of (int * string) (*
schema field has invalid property
*)
| TOOMANY of int (*
simple schema has too many records (> 1)
*)
| INVALID of validation_error list (*
multi-schema contains invalid schema record(s)
*)
| MANYSKEYS of int (*
schema has too many schema key fields
*)
Errors that can occur when compiling schemas.

int's are line numbers; strings are field names.

val string_of_schemaerror : string -> schemaerror -> string
string_of_schemaerror loc error: convert a schemaerror to a string, assuming loc is the location (e.g. filename).
type schema 
The type of simple schema.
type multi 
The type of multi-schemas.
type schemas = 
| Simple of schema
| Multi of multi
| Bad of schemaerror list
The type of a compiled schema or multi-schema.
val string_of_schemas : schemas -> string
val getref : ?def:'a list ->
prop * prop -> 'b -> ('b * 'a list) list -> 'a list

Compiling Schemas in Refer Representation to Internal Form


val compile_stream : ?loc:string -> ?skey:SM.key -> char Stream.t -> schemas
compile_stream ?loc ?skey stream: compile the schema on stream to internal form.
loc : location (typically filename)
skey : the name of the schema key field (if any); required for multi-schema; not allowed for simple schema
val compile_channel : ?loc:string ->
?skey:SM.key -> Pervasives.in_channel -> schemas
compile_channel ?loc ?skey channel: compile the schema file open on channel to internal form.
loc : location (typically filename)
skey : the name of the schema key field (if any); required for multi-schema; not allowed for simple schema
val compile_file : ?skey:SM.key -> string -> schemas
compile_file ?skey file: compile the schema file to internal form.
skey : the name of the schema key field (if any); required for multi-schema; not allowed for simple schema
val compile_string : ?loc:string -> ?skey:SM.key -> string -> schemas
compile_string ?loc ?skey file: compile the schema in string to internal form.
loc : location (typically filename)
skey : the name of the schema key field (if any); required for multi-schema; not allowed for simple schema

Validating Refer Databases


val validate : ?loc:string ->
?strict:bool ->
schemas ->
SS.t SM.t * validation_error list ->
int ->
(SM.key * SS.elt) list ->
SS.t SM.t * validation_error list
validate ?loc ?strict cschema (keys,errs) ln alist: validate the refer record in alist

This is the lowest-level validation function. It validates one record at a time, and is designed to be used with a suitable fold, in particular Kwrefer.fold. If you partially apply it with a compiled schema cschema (and optionally any of loc and strict) you have a function that can be passed directly to Kwrefer.fold:

Kwrefer.fold (validate ?loc ?strict (compile_file schemafile)) (SS.empty,[]) chan

The accumulator (keys,errs) is dual purpose: errs accumulates a list of validation errors, if any. keys accumulates the values from the KEY fields of the database for duplicate-checking; duplicate keys are reported as errors in errs, so you can discard keys after the fold, or use it as a handy set of all key values.

loc : location (typically filename)
strict : whether or not to parse the database in strict mode
val validate_record : ?loc:string ->
?strict:bool ->
schemas ->
(SM.key * SS.elt) list -> validation_error list
validate_record ?loc ?strict cschema alist: validate the refer record in alist

Convenience function to validate just one record.

loc : location (typically filename)
strict : whether or not to parse the database in strict mode
val validate_channel : ?strict:bool ->
schemas -> string -> Kwchan.src -> validation_error list
validate_channel ?strict cschema filename channel: validate the refer database open on channel.

filename is the name associated with channel, for error messages.

validate_channel, partially applied with a compile schema and optional ~strict, is suitable for use with Kwio.with_open_in_file.

strict : whether or not to parse the database in strict mode
val validate_file : ?strict:bool -> schemas -> string -> validation_error list
validate_file ?strict cschema filename: validate the refer database in filename.
strict : whether or not to parse the database in strict mode
val validate_files : ?strict:bool ->
schemas -> string list -> validation_error list
validate_files ?strict cschema filenames: validate the refer database in filenames.
strict : whether or not to parse the database in strict mode

Keyed Databases, Refermaps and Multi-maps

A keyed refer database is a one in which every record has a key field.

A key field is a distinguished field, which is required (REQ) and unique (UNIQ), and whose values are unique across the database (KEY): i.e., specifying such a value identifies a unique record in the database.

A multi-map is a Kwmap whose keys are strings (the keyfield values) and whose values are refer records represented as refermap's.

type refermap = string list SM.t 
A refer record can be represented as a refermap, which is a Kwmap whose keys are strings (representing field names) and whose values are lists of strings, representing the values of the possibly repeating occurrences.
val bucket : ?err:('a list option ->
'b ->
('c * 'a) list -> ('a * ('c * 'a) list) list * ('c * 'a) list list) ->
'c ->
('a * ('c * 'a) list) list * ('c * 'a) list list ->
'b -> ('c * 'a) list -> ('a * ('c * 'a) list) list * ('c * 'a) list list
bucket ?empty ?toomany ?none ?err field acc ln alist: group records by a distinguishing field into buckets.

bucket field returns a function suitable for Kwrefer.fold that groups records into equivalence classes based on the value of field. The return value is a pair, consisting of a Kwrefer.SM map whose keys are the unique values of field and whose values are lists of those records (as alists), and a list of the remaining records that had problems with field.

field is assumed to be a KEY field, i.e. UNIQ and REQ. The remainder list consists of records that are missing field, or whose field has no value or mutiple (repeating) values.

The Kwrefer.unbucket function can conveniently raise an error if the result contains any unbucketed records. Alternatively, you can pass an err function that will be called with the offending values option, a line number and the alist for each problematic record.

err : general error function called for any of the above cases that don't have a specific function
exception Unbucketed of int
Exception raised by Kwrefer.unbucket when discarding N unbucketed records.
val unbucket : ?strict:bool -> 'a * 'b list -> 'a
unbucket ?strict pair: discard unbucketed remainder of result of Kwrefer.bucket

If strict = true (the default), an exception is raised.
Raises Unbucketed if remainder is not of length zero
Returns the bucket map

val to_mmap : ?loc:string ->
?validate:(?loc:string ->
SS.t SM.t * validation_error list ->
int ->
(SM.key * SS.elt) list ->
SS.t SM.t * validation_error list) ->
?map:refermap SM.t ->
string -> Pervasives.in_channel -> refermap SM.t
Deprecated.You should use Kwrefer.bucket.
to_mmap ?validate ?map keyfield : function to convert a keyed refer database to multi-map.
Returns a map of refermap's
validate : optional validation function (Kwrefer.validate partially-applied with a compiled schema is suitable)
map : optional multi-map to populate (allows you to load several database files into one multi-map)

Extract fields from refermaps

In the get*u functions, the field's are assumed to have been validated as UNIQ; if they are not, the returned values are the first occurrences of each field.

val get : string list -> refermap -> string list list
Get some fields from a refermap.
Returns list of field values (as string list's)
val getu : string list -> refermap -> string list
Get some unique fields from a refermap.
Returns list of field values (as string's)
val get1 : string -> refermap -> string list
Get one field from a refermap.
Returns field value (as a string list)
val get1u : string -> refermap -> string
Get one unique field from a refermap.
Returns field value (as a string list)
val get2 : string * string -> refermap -> string list * string list
Get two fields from a refermap.
Returns field values (as a pair of string list's)
val get2u : string * string -> refermap -> string * string
Get two unique fields from a refermap.
Returns field values (as a pair of string's)
val get3 : string * string * string ->
refermap -> string list * string list * string list
Get three fields from a refermap.
Returns field values (as a triple of string list's)
val get3u : string * string * string -> refermap -> string * string * string
Get three unique fields from a refermap.
Returns field values (as a triple of string's)
val get4 : string * string * string * string ->
refermap -> string list * string list * string list * string list
Get four fields from a refermap.
Returns field values (as a quadruple of string list's)
val get4u : string * string * string * string ->
refermap -> string * string * string * string
Get four unique fields from a refermap.
Returns field values (as a quadruple of string's)

Refer Data Format

To be written.