Long-horizon AI assistants can grab old, low-quality, or conflicting memories and then answer with too much confidence, which is dangerous.
GISA is a new test (benchmark) that checks how well AI assistants can search the web like real people do.
Multi-agent systems are like teams of expert helpers; the tricky part is choosing which helpers to ask for each question.