Transcribing Video With DDD Discussion | Domain Driven Design w/ TypeScript

Last updated Feb 22nd, 2021

A design discussion about transcribing video using DDD and the clean/hexagonal architecture.

I'm always interested in what blog readers are building using Domain-Driven Design and Clean Architecture practices.

In this post, I'd like to share an email exchange I held with a reader who is working on encapsulating the complexity of building an application capable of transcribing videos given their public URL.

Very neat!

If this is something you're looking into doing, awesome. If not, hopefully this discussion is enriching anyway. I often like to be a fly on the wall in technical discussions, and by the law of the internet - I assume someone else must as well!

Hi Khalil! I've been impressed by your articles and I bought your online book, solidbook.io. And also I've been practicing and introducing DDD in real projects.

Recently, I've taken a task to handle streaming media. Download media files and reformat them. And transform the audio of media into a text. And then combine them into one media streaming like HLS with VTT. All of the above processes are conducted with "stream".

At this point I wonder how I can handle "stream" in DDD? Should I have to handle stream as domain object? Or infra? With traditional REST API app, I've almost understood concept of DDD and applied several DDD to several projects. But it is not easy to handle things like stream :( Can you share your experience or inspiration? Thanks in advance :)

"Your question,

"Should I have to handle stream as domain object? Or infra?"

The answer to your question is likely both. Though it depends. I'll explain.

First, let's recall the concept of high-level vs. low-level policy.

High-level policy is what we do
low-level policy is how we do it

The high-level policy is the domain layer and the low-level policy is your infrastructure layer

Domain layer:
- entities & aggregates — protects invariants around object behavior & state changes
- value objects — protects invariants around validation and object creation logic
Infrastructure
- repositories — object persistence and retrieval

Even highly technical domains centered around expensive I/O things can benefit from decoupling the domain from infrastructure. Consider building a new database technology, a virtual synthesizer, or in your case — streaming media.

These are all heavily I/O-heavy kinds of problems to solve. They interact with files, upper-layer protocols, user inputs, and sound. A real-world example that comes to mind is actually what powers wiki.solidbook.io. I wrote a library that converts Notion markdown pages into a website, PDF, and EPUB.

The domain layer has objects that represent these I/O concepts as domain objects.

But the majority of the I/O lives in repositories. For example, if I wanted to create a file, I'd create a file entity (high-level), the pass it to the fileRepo to persist it to the file system (low-level) at the path specified by the getAbsoluteFilePath method.

The whole reason we use abstraction layers and decouple domain from infrastructure is to separate concerns, decompose complexity, and increase cohesion by maintaining a single level of abstraction at a time.

OK, Khalil — I get it. Now, how would you do this? How would you go about solving this?

1. Event Storming or Event Modelling

Just like we walk through with the forum app we build in solidbook, I'd recommend you start with what you should do on any domain-driven design project: use Event Storming or Event Modelling to discover the domain (ie: the high-level policy and what we do).

This is a collaborative activity that you'd normally do with your team and domain experts (the customer or a representative for the customer), but I'm making a ton of assumptions on my own here.

Here's what I've landed on (you might want to download or right-click and open in a new tab to see the image better).

Commands: StartStreamJob (this is the only public one done by the user — the rest are done by the system itself), ExtractTextFromAudio, CombineMedia, CompleteStreamJob.
Events: StreamJobStarted, ExtractedTextFromAudio, CombinedMedia, StreamJobCompleted.

Now, this might not be exactly correct, but this is what I got from reading your email. You'll want to do this yourself with your team (which hopefully includes the customer or a representative for the customer like a product manager).

We now have all the use cases.

Next, we should probably discussion how much you want to handle vs. how much you want to outsource.

Do you want to write the transcoding, transcribing, and reformatting logic yourself or would you prefer to use a service that does it well already?

I'd personally recommend using a service like AWS. They have:

Transcode/reformat — https://aws.amazon.com/elastictranscoder/
Transcribe Audio (get the text from audio) — https://aws.amazon.com/transcribe/. For example, here's the API to start a transcription job using the Transcribe API.

This stuff is going to live in infrastructure. We'll write adapters for this, and if we want, we can switch any of these things out and write our own implementation. I imagine:

Transcoder interface — implemented by an AWSElasticTranscoder
Transcriber interface — implemented by an AWSTranscriber

2. State changes flow through the aggregate as a saga

A "saga" is a series of local transactions.
A streamJob has several of these.
- The high-level policy is that our saga goes → reformatMedia, extractAudio, and combineMedia. That's how we complete a streamJob.
- The low-level policy is that, in AWS, all of this work is done in S3 buckets. So when our job is done, we'll likely want to do some cleanup. A streamJobCompleted event could be subscribed to in order to clean up those temporary files in our S3 buckets.
3. An example use case
- We might realize that the streamJob aggregate needs the following properties: JobState, InitialMedia, TextFromAudio, MediaAudio, TranscribedMedia, etc. These are all domain objects used to get from the beginning to the end of the saga. And they are created as we progress through.
- For example, the TranscribeAudioUseCase, which is invoked after a MediaReformmated event, might look something like this:

class TranscribeAudioUseCase {
  ...
  execute (streamJobId: string): Promise<void> {
    const streamJob = await this.streamJobRepo.getJobById(streamJobId);
    
    if (!streamJob.shouldTranscribe()) {
      throw new Error("Not ready to transcribe");
    }

    // Gets the url of the S3 bucket containing the reformmated media then
    // transcribes it. The streamJob aggregate's `setTranscribedText()` method
    // is called, updating the state of the aggregate and saving the transcribed text
    // to the streamJob aggregate as well. A "MediaTranscribed" event is created and
    // added to the aggregate.
    await this.awsTranscriber.transcribe(streamJob);

    // The aggregate is saved and the "MediaTranscribed" event is dispatched.
    this.streamJobRepo.save(streamJob);
  }
}

And here's a snippet of what that state-changing method might look like."

class StreamJob extends AggregateRoot<StreamJobProps> {
  ...

  setTranscribedText(transcription: Transcription): InvalidStateError | void {
    if (this.state !=== "MediaReformmated") {
      return new InvalidStateError();
    }

    this.props.transcription = transcription;
    this.props.state = "MediaTranscribed";
    this.addDomainEvent(new MediaTranscribed());
  }
}

Khalil! Thank you for your comprehensive reply. And I noticed I have to learn about saga in DDD. Your example is assuming that we outsource transcription to 3rd-party service. I need to explain my situation in detail. I need to handle logic - such as transcribing, combining, creating files, removing files, etc. - in a pipeline in stream.

A video source is out there. Video source is generating video chunks continuously. In my service, the system gets a video chunk from a stream and converts it into a proper audio format. And sends it to the transcribing service and gets transcripts. And then the system saves the audio chunk and transcript as files and updates the playlist. Lastly, the system removes old files that are not necessary anymore.

I also posted pseudo code for my current implementation.

class TranscribeAudioUseCase {
  ...
  execute(videoSourceUrl: Url) {
    const videoSourceStream = this.createVideoSourceStream(videoSoureUrl)
    videoSourceStream
      .pipe(
        new Writable({
           write: async (data, encoding, next) => {
             const converted = await this.audioConverter.convert(data)
             next(converted)
           }
        })
      )
      .pipe(
        new Writable({
          write: async (data, encoding, next) => {
            const transcript = await this.transcriptionService.transcribe(data)
            next([data, transcript])
          }
        })
      )
      .pipe(
         new Writable({
            write: async (data, encoding, next) => {
              // playlistManager add new audio and transcript files and removes old files
              await this.playlistManager.update(data[0], data[1])
              next()
            }
         })
      )
  }
}

At this point, I have several questions.

1 - Do you think it's okay that a "stream" object is exposed to use cases or domain?

2 - In dealing files on domain, are domain objects allowed to have saved paths or not? For example, if we introduce AudioSegment in the domain layer, is it okay that AudioSegment has a path where it is saved? How can you decide a property is for domain or for infra?

3 - Sometimes, talking with domain experts is too abstract. I know we should develop that concept in detail and usually I do. But like this case, sometimes domain experts do not know about the concept of stream. And they just say "From this URL, get a video and extract text from it and combine them and show them to me". I know it is good to use those terms into domain objects. But I think in this case, it depends a lot on what protocol of video source and what protocol we should serve. If we use HLS we should save target duration, sequence number and so on. If we use RTMP we should save values for RTMP packets. I think serving protocols like RTMP, HLS are infra. Thus, they should not corrupt domain or be exposed to domain. But it affects too much what we should put in domain objects.

Do you think it's okay that a "stream" object is exposed to use cases or domain?

"The stream is just a functional way to handle logic. There's probably a way to do this with for or while-loops too. That being said, this approach is elegant and I see no problem with you doing this in the use case. You probably want to be able to promise-ify this and report when it's complete to the caller (controller, most likely).

As for making the concept of a stream a part of the domain, you could if that makes sense. You could model a stream as a domain object.

In dealing files on domain, domain objects are allowed to have saved paths or not? For example, if we introduce AudioSegment in the domain layer, is it okay that AudioSegment has a path where it is saved?

If it makes sense to the integrity of the domain object, then yes. Again, domain objects can know about technical stuff like file paths, permissions, and such so long as it makes sense to the domain. Your domain is inherently technical. For a userProfile, I'd also save the profilePictureURL as a value prop. Similar situation.

How can you decide a property is for domain or for infra?

I don't typically have any entities in infrastructure. My infrastructure usually consists of the building blocks on which the application can operate. We're talking repos, databases, controllers, web servers, adapters to external APIs like Stripe, Google Calendar, etc.

You said, "protocols like RTMP, HLS are infra". Protocols like HTTP are typically abstracted away from the domain layer on most of the things I build, but you are building something where the concept of protocols are inherent to the domain. The concept of HTTP probably belongs as far away from the domain in a pet store application but it also probably belongs a lot closer if your domain is a file-sharing application, or better yet - a media-streaming application.

Sometimes, talking with domain experts is too abstract. I know we should develop that concept in detail and usually, I do. But like this case, sometimes domain experts do not know about the concept of a stream. And they just say "From this URL, get a video and extract text from it and combine them and show them to me". I know it is good to use those terms in domain objects. But I think in this case, it depends a lot on what protocol of video source and what protocol we should serve. If we use HLS we should save target duration, sequence number and so on. If we use RTMP we should save values for RTMP packets.

Yeah, that's complexity. That's why we want a domain layer.

I smell a practical application of the Strategy Pattern here. Consider an abstract StreamProtocol domain object. It defines the general algorithm necessary for what you need to do to deal with the stream in all cases, but leaves out specific details like handling sequence numbers for HLS or what you should do for RTML.

This is also the Open-Closed Principle. Your Protocol object is high-level, and the concrete implementations are low-level. The Protocol domain object contains the law. The concrete implementations follow the law, but in their own unique ways. To add support for a new protocol, you merely need to write the new concrete implementation."

Stay in touch!

View more in Domain-Driven Design

Not subscribed? Get the latest newsletters straight to your inbox. Learn to write scalable, testable software.

Transcribing Video With DDD Discussion | Domain Driven Design w/ TypeScript

Stay in touch!