

The elements of the strings tensor may have different lengths (in UTF-8Ĭhars), padding may be required to get a dense vector for each row, theĮxtra (padding) pairs of logits are ignored.Ī RaggedTensor of strings where tokens is the string Greater or equal with the number of characters from each strings. Tensor should be (n, m, 2) where n is the number of strings, and m is Merge adds this character to the previous word. Split starts a new word at this character and logits is the logit for the mergeĪction for that same character. Example: strings = logits =, # I: split # 'love', # l: split, # o: merge, # v: merge, # e: merge # 'Flume', # F: split, # l: merge, # u: merge, # m: merge, # e: merge # '!', # !: split # padding:, ,, ],, # a: split, # n: merge, # d: merge # ' ', # : merge # 'tensorflow', # t: split, # e: merge, # n: merge, # s: merge, # o: merge, # r: merge, # f: merge, # l: merge, # o: merge, # w: merge ]] tokenizer = SplitMergeFromLogitsTokenizer() tokenizer.tokenize(strings, logits) ģD Tensor logits is the logit for the split action for For more info, see the doc for the logits argument below. The logits refer to the split / merge action we should take for eachĬharacter. Tokenizes a tensor of UTF-8 strings according to logits. split_with_offsetsĪlias for TokenizerWithOffsets.tokenize_with_offsets. s = # sample pair of logits indicating a split actionĪlias for Tokenizer.tokenize. The action for the first non-whitespace character. Otherwise, create a new word / continue the current one depending on M = # sample pair of logits indicating a merge action That character, for instance: s = # sample pair of logits indicating a split action If force_split_at_break_character is true, create a new word startingĪt the first non-space character, regardless of the 0/1 label for This parameter indicates what happens after a whitespace. Of this parameter, we never include a whitespace into a token, and weĪlways ignore the split/merge action for the whitespace character Start a new word after an ICU-defined whitespace character. Inherits From: TokenizerWithOffsets, Tokenizer, SplitterWithOffsets, Splitter text.SplitMergeFromLogitsTokenizer( Tokenizes a tensor of UTF-8 string into words according to logits.
