← Back ◬ AI & Machine Learning Apr 21, 2026

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

Hugging Face Archived Apr 21, 2026 ✓ Full text saved

Full text archived locally

✦ AI Summary · Claude Sonnet

Back to Articles How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas Enterprise + Article Published April 21, 2026 Will Jennings WillJenningsDC Follow nvidia Hyunwoo Kim heanu Follow nvidia Jinho Lee jinholee23232 Follow nvidia jihyeonRyu jryu93 Follow nvidia Kiran Praveen kipraveen Follow nvidia Yev Meyer nv-3mei Follow nvidia Kirit Thadaka kirit-nv Follow nvidia Shyamala Prayaga sprayaga25 Follow nvidia The models powering most AI agents today were trained primarily on English web data. They miss Korean honorific structures, regional occupation patterns, and the cultural context that Korean users expect. An agent that applies U.S. healthcare workflows to the Korean public health system isn't ready for production. Nemotron-Personas-Korea fixes this. The dataset provides 6 million fully synthetic personas grounded in official statistics and seed data from the Korean Statistical Information Service (KOSIS), the Supreme Court of Korea, the National Health Insurance Service, and the Korea Rural Economic Institute. NAVER Cloud contributed seed data and domain expertise during design. Every persona is demographically accurate but contains zero personally identifiable information (PII). It’s designed with Korea's Personal Information Protection Act (PIPA) in mind. South Korea is also one of the few countries to publish an official Synthetic Data Generation guide, establishing governance for grounding models with synthetic versions of sensitive data. This dataset follows that approach. In this tutorial, we'll turn a synthetic persona into a deployed Korean agent — from filtering the dataset to inference — in about 20 minutes using hosted APIs. A Sovereign Dataset for South Korea Attribute Detail Total personas 7 million (1 million records × 7 personas each) Persona fields 26 fields: 7 persona fields, 6 persona attribute fields, 12 demographic & geographic contextual fields, and 1 unique identifier Geographic coverage All 17 Korean provinces, and 25 districts Names ~209K unique names (118 surnames, ~21.4K given names) Occupations 2K+ categories reflecting tech, manufacturing, public sector, etc. Persona types Professional, family, sports, arts, travel, culinary, concise Life stages Student, military service, employed, unemployed, retired Language Natural Korean License CC BY 4.0 Nemotron-Personas-Korea was generated using NeMo Data Designer, NVIDIA's open-source compound AI system for synthetic data. The pipeline pairs a Probabilistic Graphical Model (Apache-2.0) for statistical grounding with Gemma-4-31B for Korean-language narrative generation. Population data comes from KOSIS (2020–2026 releases); name distributions come from the Supreme Court of Korea. Nemotron-Personas-Korea is the latest addition to the Nemotron-Personas Collection, which also covers the USA, Japan, India, Singapore (with AI Singapore), Brazil (with WideLabs), and France (with Pleias). If you're building a multilingual agent that serves Korean users alongside other markets, you can blend personas across countries in the same pipeline. Why This Matters for Autonomous Agents Most agents today are identity-blind. They follow instructions without any grounding in who they're serving. For example, an agent that books a Korean hospital appointment using US scheduling conventions, or addresses a 60-year-old patient in 반말 (“banmal,” informal language), doesn't just feel wrong. It fails. Nemotron-Personas-Korea changes this by giving your agent a Korean operating context. Load a persona into the system prompt and the agent inherits that persona's region, occupation, communication norms, and domain expertise. This works across any agent framework. Deploy with NemoClaw (NVIDIA's open-source reference stack for always-on agents running in NVIDIA OpenShell sandboxes, on anything from RTX PCs to DGX Spark), serve through NVIDIA NIM for production inference, or call the NVIDIA API directly. The persona layer is framework-agnostic, acting as a well-structured system prompt grounded in real Korean demographics. Tutorial: From Synthetic Persona to Sovereign Agent 🔗 Resources Nemotron-Personas-Korea for seeding training NeMo Data Designer for synthesizing domain-specific data NVIDIA NemoClaw for deploying always-on agents NVIDIA Developer Discord for community support Step 1: Load and Explore the Dataset Load the dataset and explore what's available. Each record contains structured demographic fields alongside rich, natural-language persona narratives. from datasets import load_dataset # Load the Korea personas dataset dataset = load_dataset("nvidia/Nemotron-Personas-Korea") # See all available fields print(dataset["train"].column_names) # Preview a single record to understand the schema print(dataset["train"][0]) Step 2: Filter and Select a Persona Filter the dataset by occupation, region, age, or any combination of fields to find personas that match your target domain. Here we'll build a Korean public health agent. # Filter for healthcare-related occupations # "보건" = public health, "간호" = nursing, "의료" = medical, "의사" = doctors health_personas = dataset["train"].filter( lambda x: "보건" in x["occupation"] or "간호" in x["occupation"] or "의료" in x["occupation"] ) print(f"Found {len(health_personas)} health personas") # Select one persona to ground your agent persona = health_personas[0] print(persona) You can refine further by region (e.g., only Jeju-based health workers), education level, or life stage. The dataset is large enough to find highly specific slices. Step 3: Define Your Agent Behavior This is where persona data becomes agent behavior. The structured fields — name, region, occupation, skills — become the agent's identity. You layer behavioral instructions and task scope on top. The result is an agent that reasons like a Korean professional in a specific role and region. # Build a system prompt from persona attributes # The prompt below instructs the agent to: # - Respond using formal Korean (존댓말) # - Provide guidance on local public health clinics # - Base answers on Korean public health policy # - Consider cultural context in consultations system_prompt = f"""당신은 한국의 공중보건 상담 AI 에이전트입니다. [신원] # Identity - 이름: {persona['name']} # Name - 지역: {persona['region']} # Region - 직업: {persona['occupation']} # Occupation - 전문분야: {persona['skills']} # Specialization [행동 지침] # Behavior guidelines - 한국어 존댓말을 사용하여 응답하세요. # Use formal Korean - 지역 보건소 및 공공 의료 체계에 대한 안내를 제공하세요. # Guide on local clinics - 한국 공중보건 정책과 절차를 기반으로 정확한 정보를 제공하세요. # Follow KR health policy - 문화적 맥락을 고려하여 상담하세요. # Consider cultural context [업무 범위] # Task scope - 예방접종 일정 안내 # Vaccination scheduling - 건강검진 절차 설명 # Health screening procedures - 지역 보건 자원 연결 # Connect to local health resources - 공중보건 관련 일반 상담 # General public health consultation """ Step 4: Deploy Your Agent Connect your persona-grounded prompt to a model for inference. You have three options depending on your setup: NVIDIA API catalog — fastest way to test (shown below) NVIDIA NIM — self-hosted inference for production deployments NemoClaw — reference stack for deploying always-on agents, runs anywhere, including on RTX PCs through DGX Spark from openai import OpenAI # NVIDIA API catalog (OpenAI-compatible) client = OpenAI( base_url="https://integrate.api.nvidia.com/v1", api_key="nvapi-YOUR_KEY" # Get a key at build.nvidia.com ) response = client.chat.completions.create( model="nvidia/nemotron-nano-8b-v1", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": "독감 예방접종은 언제 맞아야 하나요?"} # "When should I get a flu shot?" ], temperature=0.7, max_tokens=512 ) print(response.choices[0].message.content) The same workflow applies to any domain. Swap the persona filter and task scope, and you have a new agent: a 금융 ("geum-yung," finance) persona becomes a retail banking advisor, a 교육 ("gyoyug," education) persona becomes a tutoring assistant, a 공무원 ("gongmuwon," civil servant) persona becomes a government health services agent. What Grounding Changes Here's the same question — "독감 예방접종은 언제 맞아야 하나요?" (When should I get a flu shot?) — answered with and without persona grounding. Without Personas With Korean Health Worker Personas Language Responds in English/generic Korean Natural 존댓말 appropriate for health consultation Content References CDC/global guidance References Korean 보건소 schedule, national vaccination program Specificity "Visit your local clinic" "가까운 보건소에서 무료 접종이 가능합니다" with regional context Trust None Cites Korean public health policy, uses professional medical Korean The persona goes beyond translation — it contextualizes and results in an agent your users will trust. Come Build with Us in Seoul NVIDIA Nemotron Developer Days comes to Seoul today and tomorrow, April 21–22, 2026 — the first time the event has been held outside GTC. Two days of activities, including technical sessions on sovereign AI and open models, plus a hands-on hackathon where you'll have an opportunity to use Nemotron-Personas-Korea to build domain-specific Korean agents and a claw. 🦞 Join in person or via livestream. Share what you build for a chance to be featured in a future NVIDIA tutorial. More from this author Building a Fast Multilingual OCR Model with Synthetic Data nvidia 24 April 17, 2026 NVIDIA Isaac GR00T N1.7: Open Reasoning VLA Model for Humanoid Robots nvidia 9 April 17, 2026 Community Edit Preview Upload images, audio, and videos by dragging in the text input, pasting, or clicking here. Comment · Sign up or log in to comment Upvote 1 Datasets mentioned in this article 1 Collections mentioned in this article 1

💬 Team Notes