TDParser internals¶
Beyond the Reference section, here is an in-depth description of
tdparser
‘s internals.
Lexer
helpers¶
This module holds the tdparser.Lexer
class, which is available in
the top-level tdparser
module.
-
class
tdparser.lexer.
TokenRegistry
¶ This class holds a set of (token, regexp) pairs, and selects the appropriate pair to extract data from a string.
Note
The
TokenRegistry
doesn’t interact with theToken
subclasses provided throughregister()
.This means that any kind of value could be provided for this field, and will be returned as-is by the
get_token()
method.-
_tokens
¶ Holds a list of (
Token
,re.RegexObject
) tuples. These are the tokens in the order they were inserted (insertion order matters).Type: list of ( Token
subclass,re.RegexObject
) tuples
-
register
(self, token, regexp)¶ Register a
Token
subclass for the givenregexp
.Parameters: - token (tdparser.Token) – The
Token
subclass to register - regexp (str) – The regular expression (as a string) associated with the token
- token (tdparser.Token) – The
-
matching_tokens
(self, text[, start=0])¶ Retrieve all tokens matching a given text. The optional
start
argument can be used to alter the start position for thematch()
call.Parameters: - text (str) – Text for which matching (
Token
,re.MatchObject
) pairs should be searched - start (int) – Optional start position with
text
for the regexpmatch()
call
Returns: Yields tuples of (
Token
,re.MatchObject
) for each token whose regexp matched thetext
.- text (str) – Text for which matching (
-
get_token
(self, text[, start=0])¶ Retrieve the best token class and the related
match
at the start of the giventext
.The algorithm for choosing the “best” class is:
- Fetch all matching tokens (through
matching_tokens()
) - Select those with the longest match
- Return the first of those tokens
A different starting position for
match()
calls can be provided in thestart
parameter.Parameters: - text (str) – Text for which the (
Token
,re.MatchObject
) pair should be returned - start (int) – Optional start position with
text
for the regexpmatch()
call
Returns: (
Token
,re.MatchObject
) pair, the best match for the giventext
.- Fetch all matching tokens (through
-
__len__
(self)¶ The
len()
of aTokenRegistry
is the length of its_tokens
attribute.
-