Fix tls when the syscall fork is invoked #605

Chang-LeHung · 2025-05-24T15:47:02Z

Issue: #604

I'll add the corresponding test case(s) later.

codecov · 2025-05-24T15:52:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (6f006ab) to head (47d86a5).
⚠️ Report is 26 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##            master      #605   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           19        19           
  Lines         2224      2226    +2     
=========================================
+ Hits          2224      2226    +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gaogaotiantian · 2025-05-25T18:27:32Z

What is the actual impact for this issue besides "it is theortically wrong"? This fix iterates all of the existing nodes, which could be millions and that introduces a very significant overhead after forking. I want to know if the benefit trumps the problem.

I mean, when we fork the process, the existing trace is actually from the old thread id - so there's no correct answer here. If it has some real problem we can figure out a way to fix that, otherwise it's kind of benign.

Chang-LeHung · 2025-05-25T18:39:07Z

When we use multiprocessing with the fork mode, we will remove all nodes before we call update_tls. If forkserver or spawn, we do not have any tracing data. The ring buffer is not empty if and only if we directly use os.fork, but I think we should clear the buffer, just as we do in the fork mode of multiprocessing

gaogaotiantian · 2025-05-25T21:48:33Z

That did not answer the question about what we gain. I don't think we should clear the trace after we fork, that's pretty wrong.

Chang-LeHung · 2025-05-29T18:30:49Z

Given that iterating the buffer introduces a significant overhead, why do we still clear the buffer in multiprocessing?

Is that a significant overhead?

And does that introduce a 'significant' overhead? I did a test: call an empty function with no args 1 million times, and iterate a linked list with 1 million nodes, those used 370ms and 20ms respectively. It's about 5% overhead, and we should notice that it's an empty function; the percentage should be lower in real workloads. The workload of a child's process should be close to the parent's because nobody forks a process for a trivial thing. If it's not a production environment, I think it's acceptable. Anyway, we could add a command-line argument to give a more accurate result.

What we gain

Users' trust and make it clearer. It's indeed kind of benign if we master the internals of VizTracer, cause we know what exactly happens in VizTracer. The tid is wrong, but the tracing data is right, and we do not care about tid at all, so it's fine. But for users who are not that familiar with VizTracer, the result will confuse them, and they may think, "Since the tid is wrong, should I trust the result, although it looks right?"

Zero Overhead solution

We can only update the tid of metadata if we replace the tid of EventNode with a pointer points ThreadInfo. And we only update the places where we use EventNode->tid, replace it with EventNode->thread_info->tid, which is easy to implement.

Chang-LeHung · 2025-05-29T18:32:06Z

Sorry for clicking "Close with comment" by mistake.

Chang-LeHung · 2025-06-02T12:38:09Z

@gaogaotiantian

gaogaotiantian · 2025-06-10T00:47:01Z

Like I said, the result is not exactly "correct" if we "fix" it - viztracer keeps data from pre-fork and those functions did execute when the thread id was the previous value. It might be less wrong, but I don't see it as a trust issue.

You made a good point that we did this for multiprocessing+fork, which did take a while. I'll take a look at the code soon and see if it should be merged.

Chang-LeHung added 3 commits May 24, 2025 23:06

Add update_tls to fix tls info

ab0b382

Add update_tls interface

eef2dd9

Fix tls in multiprocessing programs

47d86a5

Chang-LeHung closed this May 29, 2025

Chang-LeHung reopened this May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix tls when the syscall fork is invoked #605

Fix tls when the syscall fork is invoked #605

Chang-LeHung commented May 24, 2025

Uh oh!

codecov bot commented May 24, 2025 •

edited

Loading

Uh oh!

gaogaotiantian commented May 25, 2025 •

edited

Loading

Uh oh!

Chang-LeHung commented May 25, 2025

Uh oh!

gaogaotiantian commented May 25, 2025

Uh oh!

Chang-LeHung commented May 29, 2025 •

edited

Loading

Uh oh!

Chang-LeHung commented May 29, 2025

Uh oh!

Chang-LeHung commented Jun 2, 2025

Uh oh!

gaogaotiantian commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Fix tls when the syscall fork is invoked #605

Are you sure you want to change the base?

Fix tls when the syscall fork is invoked #605

Conversation

Chang-LeHung commented May 24, 2025

Uh oh!

codecov bot commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gaogaotiantian commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Chang-LeHung commented May 25, 2025

Uh oh!

gaogaotiantian commented May 25, 2025

Uh oh!

Chang-LeHung commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Is that a significant overhead?

What we gain

Zero Overhead solution

Uh oh!

Chang-LeHung commented May 29, 2025

Uh oh!

Chang-LeHung commented Jun 2, 2025

Uh oh!

gaogaotiantian commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented May 24, 2025 •

edited

Loading

gaogaotiantian commented May 25, 2025 •

edited

Loading

Chang-LeHung commented May 29, 2025 •

edited

Loading