Skip to content

Conversation

@Chang-LeHung
Copy link
Contributor

Issue: #604

I'll add the corresponding test case(s) later.

@codecov
Copy link

codecov bot commented May 24, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (6f006ab) to head (47d86a5).
⚠️ Report is 26 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##            master      #605   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           19        19           
  Lines         2224      2226    +2     
=========================================
+ Hits          2224      2226    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gaogaotiantian
Copy link
Owner

gaogaotiantian commented May 25, 2025

What is the actual impact for this issue besides "it is theortically wrong"? This fix iterates all of the existing nodes, which could be millions and that introduces a very significant overhead after forking. I want to know if the benefit trumps the problem.

I mean, when we fork the process, the existing trace is actually from the old thread id - so there's no correct answer here. If it has some real problem we can figure out a way to fix that, otherwise it's kind of benign.

@Chang-LeHung
Copy link
Contributor Author

When we use multiprocessing with the fork mode, we will remove all nodes before we call update_tls. If forkserver or spawn, we do not have any tracing data. The ring buffer is not empty if and only if we directly use os.fork, but I think we should clear the buffer, just as we do in the fork mode of multiprocessing

@gaogaotiantian
Copy link
Owner

That did not answer the question about what we gain. I don't think we should clear the trace after we fork, that's pretty wrong.

@Chang-LeHung
Copy link
Contributor Author

Chang-LeHung commented May 29, 2025

Given that iterating the buffer introduces a significant overhead, why do we still clear the buffer in multiprocessing?

Is that a significant overhead?

And does that introduce a 'significant' overhead? I did a test: call an empty function with no args 1 million times, and iterate a linked list with 1 million nodes, those used 370ms and 20ms respectively. It's about 5% overhead, and we should notice that it's an empty function; the percentage should be lower in real workloads. The workload of a child's process should be close to the parent's because nobody forks a process for a trivial thing. If it's not a production environment, I think it's acceptable. Anyway, we could add a command-line argument to give a more accurate result.

What we gain

Users' trust and make it clearer. It's indeed kind of benign if we master the internals of VizTracer, cause we know what exactly happens in VizTracer. The tid is wrong, but the tracing data is right, and we do not care about tid at all, so it's fine. But for users who are not that familiar with VizTracer, the result will confuse them, and they may think, "Since the tid is wrong, should I trust the result, although it looks right?"

Zero Overhead solution

We can only update the tid of metadata if we replace the tid of EventNode with a pointer points ThreadInfo. And we only update the places where we use EventNode->tid, replace it with EventNode->thread_info->tid, which is easy to implement.

@Chang-LeHung
Copy link
Contributor Author

Sorry for clicking "Close with comment" by mistake.

@Chang-LeHung
Copy link
Contributor Author

@gaogaotiantian

@gaogaotiantian
Copy link
Owner

Like I said, the result is not exactly "correct" if we "fix" it - viztracer keeps data from pre-fork and those functions did execute when the thread id was the previous value. It might be less wrong, but I don't see it as a trust issue.

You made a good point that we did this for multiprocessing+fork, which did take a while. I'll take a look at the code soon and see if it should be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants