From Cosmos DB NoSQL to MongoDB API
Near-zero downtime migration from Cosmos DB NoSQL to MongoDB API
Prerequisites
Obtain Cosmos DB credentials - URI and Primary Key ("Settings" -> "Keys" in the Azure Portal)
Obtain MongoDB connection string - for the destination cluster
Enable "All Versions and Deletes" for your Cosmos DB container ("Settings" -> "Features" in the Azure Portal)
Data types and ID considerations
Cosmos DB NoSQL uses JSON while MongoDB uses BSON, so a data transformation is required. Commonly, as part of that transformation, you'd want to convert some JSON types into BSON types, such as strings into Dates for timestamps. Additionally, you may want to transfer some internal Cosmos NoSQL fields such as _ts that is used for TTL.
For the Open Source dsync, you can build your own custom transformer following this example in GitHub. You can implement it in the language of your choice (e.g. Java or Python) as long as it supports gRPC and implements the required Transform interface.
The Enterprise version of Dsync has a CEL-based transformer that you can try here. In the instructions below we will be using its format as an example given how intuitive it is.
The Cosmos DB NoSQL ID format is composed of the shard key followed by the id field. MongoDB uses a single _id field. A transform config must map between these ID formats.
See Transform Data Types for full details on JSON to BSON mappings.
Simple case: shard key is /id
/idWhen the shard key is /id, the Cosmos DB ID contains only the id field. The transform maps id to _id:
# transform.yaml
defaultmapping: default
mappings:
- namespace: default
delete: ["id"]
add: ["_id"]
mapid: id
cel:
_id: idWith a shard key prefix
When the shard key is a separate field (e.g. /region), the Cosmos DB ID is multi-part (e.g. ["us-east", "123"]). You need to use idkeys to declare the source ID fields and collapse them into a single _id:
Adjust idkeys and the cel expressions to match your Cosmos DB container's shard key configuration. See the multi-part ID examples for more patterns.
Step 1: Download dsync
Working on a large-scale production environment? Use our horizontally scalable Enterprise offering.
Use Docker (markadiom/dsync) or download the latest release from the GitHub Releases page. Note that on Mac devices you may need to configure a security exception to execute the binary by following these steps.
Alternatively, you can build dsync from the source code.
You can use Homebrew to easily install Dsync on your Mac:
We recommend using Docker for this tutotial
CosmosDB NoSQL Connector
The connector for CosmosDB NoSQL runs as a separate process because it uses the optimized Java SDK. You can run it as a Docker container.
If you'd rather build it from the source and run as a regular process, you can check out the git repository, cd into the java directory and run mvn clean install. You will need Java JDK 21 or newer. This will create a jar in the java/target directory and for convenience you can set up an alias like so (replacing the path/to/dsync with the appropriate file):
You can look at the README in the java directory for the most up to date set up instructions.
Step 2: Prepare the destination MongoDB instance
If you already have the desired destination MongoDB instance up and running, you can skip this step.
Install MongoDB
Start a local MongoDB instance:
For faster performance, we recommend creating any required secondary indexes after the initial data copy has completed.
Export MongoDB URI as a shell variable:
Step 3: Start the Cosmos NoSQL connector
You will need to set the URL and the KEY env variables to the correct values corresponding to your Cosmos DB account. See here for where to find them.
Then run cosmos-connector 8089 $URL $KEY & in the background. This starts a grpc service (running without tls) that will talk to Cosmos DB NoSQL.
If you're building dsync from the source, follow the instructions here to build the connector.
For Cloud Marketplace images and Docker:
Step 4: Start the transformer
You can start your transformer gRPC server listening on a port like 8085.
When using the Enterprise CEL-based transformer, you will need to prepare the config file as described in Data types and ID considerations , save it as config.yml, and run the process as a Docker container:
Step 5: Start dsync
Run dsync --namespace <DB>.<CONTAINER> $COSMOS_NOSQL_GRPC_URI --insecure $MONGODB_URI $TRANSFORMER_GRPC_URI --insecure. Substitute GRPC_URI with corresponding addresses for the connector and the transformer in the format grpc://localhost:port
Replace <DB>.<CONTAINER> with the desired CosmosDB NoSQL Database and Container names. We use the --insecure since we are not using TLS for our connection to the Cosmos DB NoSQL connector.
You can migrate multiple different containers at the same time by specifying multiple mappings in the --namespace param:
dsync --namespace "<DB1>.<CONTAINER1>,<DB2>.<CONTAINER2>"
Full command for Docker:
The web progress will be available on localhost:8080.
Limitations
For Cosmos DB NoSQL sources, the Open Source version of Dsync only supports CDC for a single namespace . For multiple namespaces, you can either do the initial sync only (--mode InitialSync), run multiple Dsync processes (one for each namespace), or use the Enterprise version.
Last updated