#include <btparse.h> /* Basic library initialization / cleanup */ void bt_initialize (void); void bt_free_ast (AST *ast); void bt_cleanup (void); /* Input / interface to parser */ void bt_set_stringopts (bt_metatype_t metatype, btshort options); AST * bt_parse_entry_s (char * entry_text, char * filename, int line, btshort options, boolean * status); AST * bt_parse_entry (FILE * infile, char * filename, btshort options, boolean * status); AST * bt_parse_file (char * filename, btshort options, boolean * overall_status); /* AST traversal/query */ AST * bt_next_entry (AST * entry_list, AST * prev_entry) AST * bt_next_field (AST *entry, AST *prev, char **name); AST * bt_next_value (AST *head, AST *prev, bt_nodetype_t *nodetype, char **text); bt_metatype_t bt_entry_metatype (AST *entry); char *bt_entry_type (AST *entry); char *bt_entry_key (AST *entry); char *bt_get_text (AST *node); /* Splitting names and lists of names */ bt_stringlist * bt_split_list (char * string, char * delim, char * filename, int line, char * description); void bt_free_list (bt_stringlist *list); bt_name * bt_split_name (char * name, char * filename, int line, int name_num); void bt_free_name (bt_name * name); /* Formatting names */ bt_name_format * bt_create_name_format (char * parts, boolean abbrev_first); void bt_free_name_format (bt_name_format * format); void bt_set_format_text (bt_name_format * format, bt_namepart part, char * pre_part, char * post_part, char * pre_token, char * post_token); void bt_set_format_options (bt_name_format * format, bt_namepart part, boolean abbrev, bt_joinmethod join_tokens, bt_joinmethod join_part); char * bt_format_name (bt_name * name, bt_name_format * format); /* Construct tree from TeX groups */ bt_tex_tree * bt_build_tex_tree (char * string); void bt_free_tex_tree (bt_tex_tree **top); void bt_dump_tex_tree (bt_tex_tree *node, int depth, FILE *stream); char * bt_flatten_tex_tree (bt_tex_tree *top); /* Miscellaneous string utilities */ void bt_purify_string (char * string, btshort options); void bt_change_case (char transform, char * string, btshort options);
Note that the interface provided by btparse, while complete, is fairly low-level. If you have more sophisticated needs, you might be interested my "Text::BibTeX" module for Perl 5 (available on CPAN).
In particular, you should have a good idea what's going on in the following:
@string{and = { and }, joe = "Blow, Joe", john = "John Smith"} @book(ourbook, author = joe # and # john, title = {Our Little Book})
If this looks like something you want to parse, but don't want to have to write your own parser for, you've come to the right place.
Before going much further, though, you're going to have to learn some of the terminology I use for describing BibTeX data. Most of it's the same as you'll find in any BibTeX documentation, but it's important to be sure that we're talking about the same things here. So, some definitions:
! $ & * + - . / : ; < > ? [ ] ^ _ ` |
A ``name'' is a catch-all used for entry types, entry keys, and field and macro names. For BibTeX compatibility, there are slightly different rules for these four entities; currently, the only such rule actually implemented is that field and macro names may not begin with a digit. Some names in the above example: "string", "and".
Working with btparse generally consists of passing the library some BibTeX data (or a source for some BibTeX data, such as a filename or a file pointer), which it then lexically scans, parses, and constructs an abstract syntax tree (AST) from. It returns this AST to you, and you call other btparse functions to traverse and query the tree.
The contents of AST nodes are the private domain of the library, and you shouldn't go poking into them. This being C, though, there's nothing to prevent you from doing so except good manners and the possibility that I might change the AST structure in future releases, breaking any badly-behaved code. Also, it's not necessary to know the structural relationships between nodes in the AST---that's taken care of by the query/traversal functions.
However, it's useful to know some of the things that btparse deposits in the AST and returns to you through those query/traversal functions. First off, each node has a ``node type,'' which records the syntactic element corresponding to each node. For instance, the entry
@book{mybook, author = "Joe Blow", title = "My Little Book"}
is rooted by an ``entry'' node; under this would be found a ``key'' node (for the entry key), two ``field'' nodes (for the ``author'' and ``title'' fields); and associated with each field node would be a ``string'' node. The only time this concerns you is when you ask the library for a simple value; just looking at the text is not enough to distinguish quoted strings, numbers, and macro names, so btparse returns the nodetype as well.
In addition to the nodetype, btparse records the metatype of each ``entry'' node. This allows you (and the library) to distinguish, say, regular entries from comment entries. Not only do they have very different structures and must therefore be traversed differently by the library, but certain traversal functions make no sense on certain entry metatypes---thus it's necessary for you to be able to make the distinction as well.
That said, everything you need to know to work with the AST is explained in bt_traversal.
Next, two enumeration types are defined: "bt_metatype" and "bt_nodetype". Both of these are used extensively in the library itself, and are made available to users of the library because they can be found in nodes of the "btparse" AST (abstract syntax tree). (I.e., querying the AST can give you "bt_metatype" and "bt_nodetype" values, so the "typedef"s must be available to your code.)
which are determined by the ``entry type'' token. (@string entries have the "BTE_MACRODEF" metatype; @comment and @preamble correspond to "BTE_COMMENT" and "BTE_PREAMBLE"; and any other entry type has the "BTE_REGULAR" metatype.)
Of these, you'll only ever deal with the last three. They are returned when you query the AST for a simple value---just seeing the text isn't enough to distinguish between a quoted string, a number, and a macro, so the AST nodetype is supplied along with the text.
There are three basic macros for constructing this bitmap:
For instance, supplying "BTO_CONVERT | BTO_EXPAND" as the string options bitmap for the "BTE_REGULAR" metatype means that all simple values in ``regular'' entries will be converted to strings: numbers will simply have their ``nodetype'' changed, and macros will be expanded. Nothing else will be done to the simple values, though---they will not be concatenated, nor will whitespace be collapsed. See the "bt_set_stringopts()" and "bt_parse_*()" functions in bt_input for more information on the various options for parsing; see bt_postprocess for details on the post-processing.
#include <btparse.h> int main (void) { bt_initialize (); /* process some data */ bt_cleanup (); exit (0); }
Please note the call to "bt_initialize()"; this is very important! Without it, the library may crash or fail mysteriously. You must call "bt_initialize()" before calling any other btparse functions. "bt_cleanup()" just frees the memory allocated by "bt_initialize()"; if you are careful to call it before exiting, and "bt_free_ast()" on any abstract syntax trees generated by btparse when you are done with them, then your program shouldn't have any memory leaks. (Unless they're due to your own code, of course!)
Another limitation that is due to PCCTS: entries with a large number of fields (more than about 90, if each field value is just a single string) will cause the parser to crash. This is unavoidable due to the parser using statically-allocated stacks for attributes and abstract-syntax tree nodes. I could increase the static allocation, but that would just decrease the likelihood of encountering the problem, not make it go away. Again, the chances of this changing as long as I'm using PCCTS 1.x are nil.
Apart from those inherent limitations, there are no known bugs in btparse. Any segmentation faults or bus errors from the library should be considered bugs. They probably result from using the library incorrectly (eg. attempting to interleave the parsing of two files), but I do make an attempt to catch all such mistakes, and if I've missed any I'd like to know about it.
Any memory leaks from the library are also a concern; as long as you are conscientious about calling the cleanup functions ("bt_free_ast()" and "bt_cleanup()"), then the library shouldn't leak.
To traverse the syntax tree that results, see bt_traversal.
To learn what is done to values in parsed entries, and how to customize that munging, see bt_postprocess.
To learn how btparse deals with strings, see bt_strings (oops, I haven't written this one yet!).
To manipulate and access the btparse macro table, see bt_macros.
For splitting author names and lists ``the BibTeX way'' using btparse, bt_split_names.
To put author names back together again, see bt_format_names.
Miscellaneous functions for processing strings ``the BibTeX way'': bt_misc.
A semi-formal language definition is in bt_language.
This library is free software; you can redistribute it and/or modify it under the terms of the GNU Library General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Library General Public License for more details.
You should have received a copy of the GNU Library General Public License along with this library; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
http://starship.python.net/~gward/btOOL/
You will also find the latest version of Text::BibTeX, the Perl library that provides a high-level front-end to btparse, there. btparse is needed to build "Text::BibTeX", and must be downloaded separately.
Both libraries are also available on CTAN (the Comprehensive TeX Archive Network, "http://www.ctan.org/tex-archive/") and CPAN (the Comprehensive Perl Archive Network, "http://www.cpan.org/"). Look in biblio/bibtex/utils/btOOL/ on CTAN, and authors/Greg_Ward/ on CPAN. For example,
http://www.ctan.org/tex-archive/biblio/bibtex/utils/btOOL/ http://www.cpan.org/authors/Greg_Ward
will both get you to the latest version of "Text::BibTeX" and btparse --- but of course, you should always access busy sites like CTAN and CPAN through a mirror.