Architecture

Learn more about how it works

The primary abstraction is a connector. Connectors are components that interface with the data stores that are responsible for fetching and writing data into them. Connectors conform to a defined gRPC interface and we can support any valid implementation. Connectors can be source only, destination only, or both. Dsync is software that coordinates data activities between connectors. For example, data movement/replication is supported from a connector that supports source capabilities into a connector that supports sink capabilities. Verification is supported for two connectors that support source capabilities. Dsync will ensure the specified connectors for any operation are compatible.

To support transforms, we also support a gRPC interface for transforms. Transforms are useful if data needs to be mapped from a source to destination.

                    TRANSFORM
                        ^
                        |
                        v
SOURCE CONNECTOR <--> DSYNC <--> SINK CONNECTOR

Connector Interface

For the specifics of the gRPC interface, refer to the protobuf definitions.

Connectors must define the capabilities they support via the GetInfo endpoint and basic information. For basic information, generally in type one would specify the underlying database, such as MongoDB. The Id is used to identify a connector is the same for resumability purposes.

A connector must either provide a non nil definition for at least one of the Source or Sink capabilities to describe whether it supports being a source, destination, or both.

Within each of these capabilities, a connector must specify which data types it supports. Currently, the only supported types are:

The MONGO_BSON type which is specifically a type that uses a BSON encoding and also by definition has an _id field that acts as the primary key
The JSON_ID type that has an id field as the primary key.

This may not be true for all future types.

Source Connector

Source connectors by default support a single specified namespace, but can additionally be configured to declare support multiple namespaces or no specified namespaces. These are simply options to provide flexibility such as no specified namespace meaning "everything", but the interpretation is up to the connector.

The definition of what string is a namespace is also up to the individual connector. For example, it could be a table name for DynamoDB, or a fully qualified db.collection identifier for MongoDB.

Source connectors implement the GetNamespaceMetadata, GeneratePlan, ListData, and StreamUpdates, and StreamLSN endpoint. If it does not provide the lsn stream capability, then StreamLSN should just return success right away.

GetNamespaceMetadata currently only asks for the size of the namespace, which could be an estimate as well.

GeneratePlan is an important endpoint for specifying how the source can be partitioned. For the initial sync piece, a source could provide various ranges for query parallelism. For updates, a single partition may be returned to represent everything by not specifying a namespace.

ListData and StreamUpdates are used to fetch the data for the initial sync and update plans respectively. ListData must preserve the output of the last cursor until the next cursor is used to account for failure scenarios. It is acceptable to expire the last cursor's validity after enough time has passed. On the updates side, the next cursor is used to update a checkpoint on the stream. Ideally every update should come with it so we can resume from that point onwards.

Destination Connector

Destination connectors implement the WriteData and WriteUpdates endpoints. These are both batch endpoints that should only return a success once the provided updates or the data is considered persisted. Furthermore reapplying the same update should not create duplicate data in the underlying data store. Under the hood they could both use the same implementation, the primary difference is that WriteData may provide an opportunity for optimizing a pure insert/upsert command.

Transform Interface

For the specifics of the gRPC interface, refer to the protobuf definitions.

GetTransformInfo is implemented to determine what data type to what data type mappings are supported. At the very least this could be just the same data type to itself.

GetTransform is the actual endpoint that performs the transform. It basically should apply the requested transform to both the data and updates whichever are present and may also update the namespace.

Signaling Protocol

In order to effectively decouple the reading and writing responsibilities, connectors implement an in-band signaling protocol to communicate task boundaries, CDC progress and metadata updates. For example:

When a task is fully read, the source connector sends a special message to let the destination connector know that there will be no more data coming for that task. When the destination connector sees that message, it can assert that once all pending data for the task is written to the destination, it's safe to notify the Coordinator that the task is fully complete.
During CDC, the source connector sends periodic "barrier" messages that contain the current stream offset or token ("checkpoint"). Upon encountering a barrier message, the destination connector can assert that once all pending change events are processed, it's safe to communicate the checkpoint to the Coordinator.

For all connectors using the gRPC interface, the signaling protocol is implemented automatically.

PreviousFAQs NextVerification

Last updated 3 months ago