# Data Transformations

## Overview

Enterprise Dsync features a prebuilt YAML-based transformer that allows to add, remove and modify data elements using mappings and Common Expression Language ([CEL](https://github.com/google/cel-spec)).

Transformations are applied on-the-fly during both initial sync and CDC. The transformer itself can be run as a standalone process that Dsynct connects to over gRPC. The detailed instructions can be found in our [public repository.](https://github.com/adiom-data/public/tree/main/dsync-transform)

For convenience, Dsynct workers can run the transformer as an embedded process by providing the `--transform` option along with the path to the config file as the third argument:

```bash
dsynct worker <OTHER_OPTIONS> --transform $SOURCE $DESTINATION dsync-transform://transform.yaml
```

When using Docker to run Dsynct, the transformer config file needs to be mounted to the container. For example:

```
docker run \
-v "./transform.yaml:/transform.yaml" \
markadiom/dsynct worker <OTHER_OPTIONS> \
--transform \
$SOURCE $DESTINATION dsync-transform://transform.yaml
```

## Writing Config Files

Dsync Transform runs off a YAML configuration file where the mappings are specified. Each source document is converted into an internal format, subjected to the mappings, and then converted back to the output document type.

Duplicate mappings are allowed and will fan out. Use the `filter` feature to avoid fanout if necessary. Note that mapping IDs should also require mapping the ID keys. ID mappings only have access to the original ID so that they can be applied to deletes which do not have access to the full data. If you need to convert ID types between systems (e.g., string to BSON ObjectID), see the [Transform Data Types](https://docs.adiom.io/enterprise/running-dsynct/data-types) page for detailed guidance and examples.

### Example Config File

```yaml
mappings:
  - namespace: srcnamespace
    mapnamespace: dstnamespace
    map:
      should_be_int32: int32
    cel:
      name: self + "!"
      newfield: '"abcd"'
      should_be_int32: self + 5
    add: ["newfield"]
    delete: ["existingfield"]
```

Each mapping must specify the source `namespace`. If the destination namespace is different, use `mapnamespace`. The key fields work as follows:

* **`cel`** -- Specify a CEL expression for mapping each field. The variable `self` refers to the current value of that field. Note that CEL only supports a limited set of types (e.g., 64-bit integers but not 32-bit integers).
* **`map`** -- Apply a special type mapping *after* the `cel` expression. Use this when you need a type that CEL cannot represent directly, such as `int32` for a 32-bit integer.
* **`add`** -- List fields that should be created in the output even if they were not present in the source document.
* **`delete`** -- List fields that should be removed from the output.

### Configuration Reference

#### Top-Level Options

| Option            | Type              | Default | Description                                                                                                                                                 |
| ----------------- | ----------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `wild`            | string            | `*`     | When specifying a path, this matches anything.                                                                                                              |
| `delimiter`       | string            | `.`     | When specifying a path, this is the delimiter.                                                                                                              |
| `env`             | map\[string, any] |         | Variables available under the `env` variable in CEL expression mappings.                                                                                    |
| `unwrapbson`      | boolean           | `false` | If true, will automatically convert various BSON types to a more native type (e.g., ObjectIDs to strings).                                                  |
| `filtererrors`    | boolean           | `false` | If true, will not fail on errors during conversion and instead skip and log a warning. Errors encountered when retrieving the original ID will still error. |
| `defaultmapping`  | string            |         | Name (namespace) of the mapping from `mappings` to use as a fallback.                                                                                       |
| `namespacemapper` | CEL string        |         | Default expression to automatically map all namespaces. Has `env` and `self` (namespace) available.                                                         |
| `idlist`          | boolean           | `false` | If true, the `id` variable will always be a list. When false, the `id` variable contains the first id value if it is the only id value.                     |
| `mappings`        | list\[mapping]    |         | List of mapping definitions (see below).                                                                                                                    |

#### Mapping Options

Each entry in `mappings` supports the following fields:

| Option         | Type                     | Description                                                                                                                                |
| -------------- | ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `namespace`    | string                   | Namespace this mapping applies to.                                                                                                         |
| `mapnamespace` | string                   | New namespace name for the output.                                                                                                         |
| `mapid`        | CEL string               | Expression to map the `id` for updates. Only has the original `id` field and `env` available.                                              |
| `filter`       | CEL string               | Expression that returns a boolean; if true, the document will be retained. Only has the original `id` field and `env` available.           |
| `idkeys`       | list\[string]            | Describes the original names of each part of the id.                                                                                       |
| `finalidkeys`  | list\[string]            | Describes the names of each part of the id after the mapping.                                                                              |
| `add`          | list\[string]            | Paths that will be added in the mapping if the parent exists.                                                                              |
| `delete`       | list\[string]            | Paths that will be deleted if they exist.                                                                                                  |
| `cel`          | map\[string, CEL string] | For each defined path, specify a CEL expression to perform a mapping. Has `env`, `id`, `doc`, `parent`, and `self` available as variables. |
| `map`          | map\[string, string]     | For each defined path, specify a special mapping function to apply. Applies after `cel`.                                                   |
| `self`         | CEL string               | An expression that serves as a mapping for the whole document.                                                                             |

### Available Mappings

For use with the `map` configuration or inside a `cel` configuration expression. In certain cases, it is advisable to use `map` to force a type that CEL cannot represent directly.

#### Type Conversions

| Mapping       | Description                                      |
| ------------- | ------------------------------------------------ |
| `int32`       | Converts to an int32. Should use in `map` only.  |
| `float`       | Converts to a float32. Should use in `map` only. |
| `json_number` | Converts to a JSON Number.                       |
| `json_decode` | Decodes a JSON string or bytes into an object.   |
| `json_encode` | Encodes an object as a JSON string.              |

#### BSON Conversions

| Mapping                  | Description                                                     |
| ------------------------ | --------------------------------------------------------------- |
| `bson_decimal128`        | Converts a string to a BSON Decimal128.                         |
| `bson_decimal128_string` | Converts a BSON Decimal128 to a string.                         |
| `bson_object_id`         | Converts a string to a BSON ObjectID. Should use in `map` only. |
| `bson_uuid`              | Converts a UUID string to a BSON UUID.                          |
| `bson_object_id_string`  | Converts BSON ObjectID to a string.                             |
| `bson_uuid_string`       | Converts BSON UUID to a string.                                 |

#### Hash Functions

| Mapping  | Description                                                     |
| -------- | --------------------------------------------------------------- |
| `md5`    | Applies the MD5 hash to a string or bytes, returning bytes.     |
| `sha1`   | Applies the SHA-1 hash to a string or bytes, returning bytes.   |
| `sha256` | Applies the SHA-256 hash to a string or bytes, returning bytes. |

#### Byte Mappings

| Mapping         | Description                                                                     |
| --------------- | ------------------------------------------------------------------------------- |
| `be_to_int32`   | Converts bytes to an int assuming big endian format. Use in `map` to get int32. |
| `be_to_int64`   | Converts bytes to an int64 assuming big endian format.                          |
| `to_be_int32`   | Converts data into bytes representing an int32 in big endian format.            |
| `to_be_int64`   | Converts data into bytes representing an int64 in big endian format.            |
| `reverse_bytes` | Reverses a byte array.                                                          |

### Available Functions

All the available mappings above are usable as unary functions in CEL expressions. The following additional functions are also available:

| Function                     | Description                                                                           |
| ---------------------------- | ------------------------------------------------------------------------------------- |
| `now_millis()`               | Current time in milliseconds.                                                         |
| `now_nanos()`                | Current time in nanoseconds (resolution may be limited by your machine).              |
| `uuid_v4_bytes()`            | Generate a random UUID as bytes.                                                      |
| `uuid_v4_string()`           | Generate a random UUID as a string.                                                   |
| `uuid_v3_bytes(uuid, name)`  | Generate a deterministic UUID based on a UUID namespace and name as bytes (MD5).      |
| `uuid_v3_string(uuid, name)` | Generate a deterministic UUID based on a UUID namespace and name as a string (MD5).   |
| `uuid_v5_bytes(uuid, name)`  | Generate a deterministic UUID based on a UUID namespace and name as bytes (SHA-1).    |
| `uuid_v5_string(uuid, name)` | Generate a deterministic UUID based on a UUID namespace and name as a string (SHA-1). |

#### Fake Data Generation

Generate deterministic fake data seeded by the input value. The same seed always produces the same output, making these suitable for consistent data anonymization. Numeric types are used directly as seeds; other types (strings, etc.) are hashed to derive a seed.

| Function                | Description                     |
| ----------------------- | ------------------------------- |
| `fake_name(seed)`       | Generate a fake full name.      |
| `fake_first_name(seed)` | Generate a fake first name.     |
| `fake_last_name(seed)`  | Generate a fake last name.      |
| `fake_email(seed)`      | Generate a fake email address.  |
| `fake_phone(seed)`      | Generate a fake phone number.   |
| `fake_address(seed)`    | Generate a fake street address. |
| `fake_city(seed)`       | Generate a fake city name.      |
| `fake_state(seed)`      | Generate a fake state name.     |
| `fake_zip(seed)`        | Generate a fake postal code.    |
| `fake_country(seed)`    | Generate a fake country name.   |
| `fake_company(seed)`    | Generate a fake company name.   |
| `fake_username(seed)`   | Generate a fake username.       |
| `fake_ipv4(seed)`       | Generate a fake IPv4 address.   |
| `fake_sentence(seed)`   | Generate a fake sentence.       |
| `fake_word(seed)`       | Generate a fake word.           |
| `fake_url(seed)`        | Generate a fake URL.            |

For the latest details, consult the README in our [public repository.](https://github.com/adiom-data/public/tree/main/dsync-transform)

## Transform Studio

In order to facilitate testing out transformations, you can run dsync in Transform Studio mode.

```
docker run -e DSYNCT_MODE=simple -p 8080:8080  markadiom/dsynct --host-port 0.0.0.0:8080 studio
```

Then open up your browser to the specified address. The interface will allow you to test out a transform config and JSON/BSON documents. For BSON documents, use extended JSON encoding. The update keys should be specified in extended JSON encoding as well.

Example extended json:

```
{"$oid": "<24-character string>"} # Mongo Object ID
{"$date": "<ISO-8601 String>"} # Date
```
