Unstructured Data
Data that does not follow a predefined format — emails, documents, images, videos, and conversations — which AI can now analyse and extract value from.
Unstructured data is information that does not follow a predefined data model or format. It includes emails, documents, images, videos, audio recordings, social media posts, chat logs, and presentations — essentially, everything that does not fit neatly into a spreadsheet or database table.
Why unstructured data matters
An estimated 80-90% of all data generated by organisations is unstructured. Before AI, this data was largely inaccessible to automated analysis. Companies stored it, but extracting insights required humans to manually read, watch, or listen to it. AI, particularly LLMs and multimodal models, has changed this dramatically.
Structured vs unstructured data
- Structured data: Lives in databases with defined schemas. Customer records, transaction logs, inventory counts. Easy for machines to process but captures only a fraction of organisational knowledge.
- Semi-structured data: Has some organisation but not a rigid schema. Emails (to/from/date fields + free text body), JSON files, XML documents.
- Unstructured data: No predefined schema. Word documents, PDFs, images, videos, voice recordings. Requires AI to extract meaning.
How AI processes unstructured data
- Text documents: LLMs can read, summarise, extract data from, and answer questions about any text document.
- Images and PDFs: Multimodal models can interpret charts, diagrams, photos, and scanned documents.
- Audio: Speech-to-text converts voice recordings into analysable text.
- Video: Vision models can describe, classify, and search video content.
- Mixed media: Modern AI handles documents that combine text, images, tables, and charts.
Business applications
- Knowledge management: Making internal documents searchable and queryable using semantic search and RAG.
- Customer insight: Analysing support tickets, reviews, call transcripts, and social media for themes and sentiment.
- Compliance: Scanning contracts, emails, and communications for regulatory risks.
- Due diligence: Processing thousands of documents during mergers, acquisitions, or audits.
- Research: Synthesising insights from reports, papers, and articles.
The unstructured data opportunity
Most organisations are sitting on vast stores of unstructured data that contain valuable insights. The combination of LLMs, vector databases, and RAG makes it possible to unlock this data for the first time. A well-built system can answer questions across thousands of documents in seconds.
Challenges
- Data quality: Unstructured data is often messy, duplicated, or outdated.
- Privacy: Unstructured data frequently contains personal information that must be handled carefully.
- Volume: The sheer amount of unstructured data can be overwhelming without clear prioritisation.
Why This Matters
Unstructured data represents the largest untapped data asset in most organisations. AI's ability to process it transforms previously inaccessible information into actionable insights. Understanding this capability helps you identify high-value opportunities where AI can extract value from data you already have but could never efficiently analyse before.
Related Terms
Continue learning in Essentials
This topic is covered in our lesson: AI for Knowledge Work