Pattern Syntax

PATTERNS

In Cariochi Patterns, the syntax revolves around the use of "classes," "fields," and "patterns" to identify and extract specific information from unstructured text. Let's delve into each element:

Patterns

Regular expressions form the basis of patterns in Cariochi Patterns. These expressions are used to match specific entities within the text. Patterns can include both classes and fields to extract relevant information accurately.

Fields

Fields are specific attributes or values within a class that help define the structure of the extracted information. In Cariochi Patterns, fields are indicated by the "#" symbol followed by the field identifier. They provide a way to extract and organize different components of an entity.

For instance, consider the @Date class with the following pattern:

Date:
  - "{#year: \d{4}}-{#month: \d{2}}-{#day: \d{2}}"

In this example, the pattern specifies the structure of a date with fields for the year, month, and day components. When the pattern matches the text, the year, month, and day values are extracted as separate fields within the @Date entity.

Inner Classes

Cariochi Patterns also introduces the concept of inner classes. Inner classes are classes defined within a pattern and are represented using the "@" symbol followed by the class name. They allow you to specify that text matched by the pattern should be interpreted as an instance of the inner class.

For instance, inner classes can be applied to each component of a date. An example of this is the inner class @Int, which indicates that extracted values should be treated as integers, thus enhancing the semantic interpretation of the matched data:

Date:
  - "{@Int: \d{4}}-{@Int: \d{2}}-{@Int: \d{2}}"

Another approach involves combining both fields and inner classes within a pattern:

Date:
  - "{#year: {@Int: \d{4}}}-{#month: {@Int: \d{2}}}-{#day: {@Int: \d{2}}}"

In this case, the inner class @Int is used with the #year, #month, and #day fields to define that the matched text should be treated as integer values.

Example

Patterns
classes:

  Date:
    - "{#year: {@Int: \d{4}}}-{#month: {@Int: \d{2}}}-{#day: {@Int: \d{2}}}"
    
Input
The deadline for submission is 2023-07-15.
Structured Text Output
The deadline for submission is 
{@Date: {#year: {@Int: 2023}}-{#month: {@Int: 07}}-{#day: {@Int: 15}}}.
JSON Output
[
  {
    "text": "The deadline for submission is 2023-07-15.",
    "entities": [
      {
        "class": "Date",
        "text": "2023-07-15",
        "entities": [
          {
            "field": "year",
            "class": "Int",
            "text": "2023"
          },
          {
            "field": "month",
            "class": "Int",
            "text": "07"
          },
          {
            "field": "day",
            "class": "Int",
            "text": "15"
          }
        ]
      }
    ]
  }
]

Classes

Classes are represented by names preceded by the "@" symbol. They act as categories for different entities you want to recognize in the text. These classes can be used as building blocks within patterns to create more sophisticated and accurate data extraction rules.

Example:

Patterns
classes:

  Currency:
    - $
    - €
    - Β£

  Number:
    - "\d+(\.\d+)?"

  Money:
    - "{@Currency}{@Number}"     # pattern with classes
Input
The price of the product is $199.99. 
It is available at €149.99 in Europe.
Structured Text Output
The price of the product is {@Money: {@Currency: $}{@Number: 199.99}}.
It is available at {@Money: {@Currency: €}{@Number: 149.99}} in Europe.
JSON Output
[
  {
    "text": "The price of the product is $199.99. ",
    "entities": [
      {
        "class": "Money",
        "text": "$199.99",
        "entities": [
          {
            "class": "Currency",
            "text": "$"
          },
          {
            "class": "Number",
            "text": "199.99"
          }
        ]
      }
    ]
  },
  {
    "text": "It is available at €149.99 in Europe.",
    "entities": [
      {
        "class": "Money",
        "text": "€149.99",
        "entities": [
          {
            "class": "Currency",
            "text": "€"
          },
          {
            "class": "Number",
            "text": "149.99"
          }
        ]
      }
    ]
  }
]

Abstract Classes

Cariochi Patterns supports a hierarchical structure for classes, allowing you to organize and group related entities more efficiently. You can create abstract classes, which serve as parent classes without specific patterns, and child classes with their own unique patterns.

Example:

Patterns
classes:

  Number:               # abstract class {@Number}
    Real:               # class {@Number.Real}
      - "\d+\.\d+"
    Int:                # class {@Number.Int}
      - "\d+"

  Percent:
    - "{@Number}%"      # pattern with an abstract class
    
Input
The company's profit margin significantly increased from 0.1% to 3%.
Structured Text Output
The company's profit margin significantly increased from 
{@Percent: {@Number.Real: 0.1}%} to {@Percent: {@Number.Int: 3}%}.
JSON Output
[
  {
    "text": "The company's profit margin significantly increased from 0.1% to 3%.",
    "entities": [
      {
        "class": "Percent",
        "text": "0.1%",
        "entities": [
          {
            "class": "Number.Real",
            "text": "0.1"
          }
        ]
      },
      {
        "class": "Percent",
        "text": "3%",
        "entities": [
          {
            "class": "Number.Int",
            "text": "3"
          }
        ]
      }
    ]
  }
]

Private Classes

Private classes are a valuable feature in Cariochi Patterns that allows users to create intermediate classes without including them in the final output. They are represented by names preceded by the "_" symbol.

Example:

Patterns
classes:

  Number:
    - "\d+(\.\d+)?"
  
  Percent:
    - "{@Number}%"  
    
  FinIndicator:
    - "EPS"
    - "RPS"
    - "DPS"

  _action:                                    # private class
    - "grew by"
    - "growing by"
    - "rose to"
    - "surged to"
    - "remained stable at"

  FinData:
    - "{@FinIndicator} {@_action} {@Percent}"  # pattern with a private class
    
Input
EPS grew by 15% to $2.50 per share, reflecting strong financial performance.
Structured Text Output
{@FinData: {@FinIndicator: EPS} grew by {@Percent: {@Number: 15}%}} to 
${@Number: 2.50} per share, reflecting strong financial performance.
JSON Output
[
  {
    "text": "EPS grew by 15% to $2.50 per share, reflecting strong financial performance.",
    "entities": [
      {
        "class": "FinData",
        "text": "EPS grew by 15%",
        "entities": [
          {
            "class": "FinIndicator",
            "text": "EPS"
          },
          {
            "class": "Percent",
            "text": "15%",
            "entities": [
              {
                "class": "Number",
                "text": "15"
              }
            ]
          }
        ]
      },
      {
        "class": "Number",
        "text": "2.50"
      }
    ]
  }
]

Sample-based Patterns

The "Sample-based Patterns" feature is a unique capability that allows users to define custom patterns using concrete examples. These patterns start with the "~" symbol and act as templates to recognize similar occurrences in the text.

Example

Patterns
classes:

  Number:
    Real:
      - "\d+\.\d+"
    Int:
      - "\d+"

  Date:
    - "~{#year: 2000}-{#month: 05}-{#day: 10}"
Input
The deadline is 2023-07-15.
Structured Text Output
The deadline is {@Date: {#year: {@Number.Int: 2023}}-{#month: {@Number.Int: 07}}-{#day: {@Number.Int: 15}}}.
JSON Output
[
  {
    "text": "The deadline is 2023-07-15.",
    "entities": [
      {
        "class": "Date",
        "text": "2023-07-15",
        "entities": [
          {
            "field": "year",
            "class": "Number.Int",
            "text": "2023"
          },
          {
            "field": "month",
            "class": "Number.Int",
            "text": "07"
          },
          {
            "field": "day",
            "class": "Number.Int",
            "text": "15"
          }
        ]
      }
    ]
  }
]

By leveraging the power of classes, fields, and patterns, Cariochi Patterns enables users to efficiently extract structured data from unstructured text, making it easier to analyze and understand textual information effectively.

More examples on demo.cariochi.com

Last updated