AI Data Lead
Our client is a global LegalTech organization delivering data-driven digital solutions that help businesses manage regulatory complexity, improve compliance processes, and make smarter decisions.
We usually respond within a day
📁 About our client:
Our client is the global leader in regulatory and sustainability intelligence, helping the world's largest companies navigate Environment, Health & Safety, Corporate Sustainability, and Product Compliance.
They are reinforcing their AI capabilities to keep their leadership as an AI-native compliance intelligence platform, with a foundational AI platform built around ontology, data, and platform surfaces, strong CAIO sponsorship, and direct executive air cover for the product layer of the AI transformation.
🚀 Responsibilities:
Data Platform & Retrieval
Design, build, and own the Client & AI Data Vault end-to-end: document ingestion, clause-level parsing, chunking, versioning, and the query surface that every AI product consumes.
Design the data model that maps client artifacts to the regulatory ontology so applicability and gap-analysis products can actually answer client-specific questions.
Own the general database layer: SQL schema design, performance tuning, and the contract between operational stores and the AI retrieval surface.
Vector DB & RAG Operations
Operate the vector database tier in production: indexing strategy, embedding model selection, similarity search tuning, recall vs. latency trade-offs, and the RAG infrastructure on top.
Prototype chunking and embedding strategies rapidly against real client documents before committing to production patterns.
Carry strong opinions (loosely held) about where embeddings earn their keep versus where classical retrieval still wins.
Document Intelligence & Parsing
Build robust parsers for messy source formats: PDFs, DOCX, spreadsheets, scanned documents, regulatory filings, with clause extraction, table handling, and layout awareness.
Data Quality & Governance
Define and enforce data quality: ingestion validation, embedding drift detection, versioning, and the audit trail every regulated AI product needs.
Team & Technical Leadership
Lead a small squad of data and document intelligence engineers as a player-coach: roughly half of your time on hands-on build, half on direction-setting, review, and growing the people around you.
Own hiring into the team and set the technical bar: code review culture, pipeline patterns, testing strategies, and operational discipline.
👤 Profile sought:
Experience:
Clear history of shipping data and retrieval systems into production with measurable impact. Prototypes that never left the lab don't count.
Hands-on experience building RAG systems or hybrid retrieval (vector + keyword + graph), including the failure modes and how to debug them.
Track record designing chunking strategies and embedding pipelines for real-world documents, especially for legal, regulatory, or technical content.
Production experience with relational databases: schema design, query optimization, and migration discipline.
Technical skills:
Deep, hands-on experience operating a vector database in production (Qdrant, Pinecone, Weaviate, pgvector, or equivalent). Index tuning, recall vs. latency management, and clear opinions on which tool to reach for when.
Strong Python skills, plus comfort with at least one compiled or systems-level language when performance matters.
Hands-on experience parsing messy PDFs, DOCX, and structured extraction. Familiarity with layout-aware models, OCR, or clause-level extraction tooling is a strong plus.
Production experience with cloud platforms (Azure preferred, AWS or GCP a plus). Comfortable with containerization, CI/CD, and the operational side of data infrastructure.
Active user of AI coding assistants, integrated into daily workflow, with a clear point of view on what they are good and bad at, especially for parser development and schema exploration.
Bonus:
Experience with multi-tenant data architectures and tenant isolation patterns.
Familiarity with Elasticsearch, OpenSearch, or comparable full-text search engines.
Background in NLP, information extraction, or document understanding.
Experience with Kafka or equivalent message brokers for ingestion pipelines.
Experience in regulated industries (EHS, legal, financial, healthcare) where auditability and versioning are non-negotiable.
Contributions to open-source retrieval, embedding, or document parsing tooling.
Languages:
Fluent English required.
Soft skills:
Player-coach leadership style: technically credible, willing to ship code, equally focused on growing the team.
Opinionated about tools, pragmatic about deadlines.
Strong communicator across technical and business audiences.
Bias for delivery and operational discipline over research polish.
🌍 Benefits & Culture:
Tech stack: Vector DBs (Qdrant / Pinecone / Weaviate / pgvector), Python, Azure, RAG, embeddings, document parsing, SQL, Kafka, Elasticsearch.
Direct access to leadership and the Chief AI Officer: short feedback loops, real influence on architecture and direction.
A seat on a small, high-impact AI team where the data layer is the single largest product-portfolio unlock.
Culture that treats AI tools as force multipliers, not novelties.
Competitive compensation, benefits, and flexibility.
Hybrid in Lisbon's Office role (3 days a week at the office)
💼 Department: AI & Engineering
📍 Location: Lisbon
📆 Start date: ASAP
- Locations
- Lisboa
- Remote status
- Hybrid