AI on Mobile: Two Architectural Patterns
Mobile AI applications fall into two architectural camps: those that process AI workloads in the cloud and stream results to the device, and those that run inference directly on-device using models optimized for mobile hardware. Each pattern has distinct tradeoffs around latency, privacy, cost, and capability.
Cloud AI gives you access to the most capable models (GPT-4o, Claude 3.5) with no device constraints, at the cost of network latency and API fees. On-device AI with models like Phi-3 Mini or Llama 3.2 runs offline and is free after deployment, but is limited to smaller models that fit in device memory.
Most production apps use both: cloud for complex tasks, on-device for fast, private, offline-capable features.
Cloud AI Integration in React Native
Never call LLM APIs directly from your React Native app — that would expose your API key in the client bundle. Instead, build a backend API that your app calls, and have the backend call the LLM provider.
// hooks/useChat.ts
import { useState } from 'react'
export function useChat() {
const [messages, setMessages] = useState([])
const [isLoading, setIsLoading] = useState(false)
const sendMessage = async (content: string) => {
setIsLoading(true)
const userMessage = { role: 'user' as const, content }
setMessages(prev => [...prev, userMessage])
try {
const response = await fetch('https://api.yourapp.com/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${userToken}`
},
body: JSON.stringify({ messages: [...messages, userMessage] })
})
const data = await response.json()
setMessages(prev => [...prev, { role: 'assistant', content: data.content }])
} finally {
setIsLoading(false)
}
}
return { messages, sendMessage, isLoading }
}
Streaming Responses in React Native
Streaming makes AI chat feel much faster — the user sees text appearing rather than waiting for the complete response. React Native supports streaming via the Fetch API with response.body (available in React Native 0.72+).
const streamMessage = async (content: string) => {
const response = await fetch('https://api.yourapp.com/chat/stream', {
method: 'POST',
body: JSON.stringify({ message: content }),
headers: { 'Content-Type': 'application/json' }
})
const reader = response.body!.getReader()
const decoder = new TextDecoder()
let assistantMessage = ''
setMessages(prev => [...prev, { role: 'assistant', content: '' }])
while (true) {
const { done, value } = await reader.read()
if (done) break
assistantMessage += decoder.decode(value, { stream: true })
setMessages(prev => [
...prev.slice(0, -1),
{ role: 'assistant', content: assistantMessage }
])
}
}
Building the Chat UI
Use FlatList with inverted prop for the message list — this ensures new messages appear at the bottom without complex scroll management. Add a typing indicator component during streaming.
import { FlatList, TextInput, TouchableOpacity } from 'react-native'
function ChatScreen() {
const { messages, sendMessage, isLoading } = useChat()
const [input, setInput] = useState('')
return (
<View style={styles.container}>
<FlatList
data={[...messages].reverse()}
inverted
keyExtractor={(_, i) => i.toString()}
renderItem={({ item }) => <MessageBubble message={item} />}
/>
<View style={styles.inputRow}>
<TextInput
value={input}
onChangeText={setInput}
placeholder="Ask anything..."
style={styles.input}
/>
<TouchableOpacity
onPress={() => { sendMessage(input); setInput('') }}
disabled={isLoading}
>
<Text>Send</Text>
</TouchableOpacity>
</View>
</View>
)
}
On-Device AI with ONNX Runtime
For features that need to work offline or handle private data, ONNX Runtime for React Native enables running quantized ML models directly on device. Common use cases: text classification, intent detection, semantic search on local data.
import { InferenceSession, Tensor } from 'onnxruntime-react-native'
const session = await InferenceSession.create('model.onnx')
const inputTensor = new Tensor('float32', inputData, [1, 512])
const output = await session.run({ input_ids: inputTensor })
- Use INT8 quantized models — they are 4x smaller and run faster on mobile CPUs
- Test on real devices, not simulators — performance differs dramatically
- Bundle the model with the app for small models (<50MB); download on first launch for larger models
Performance and UX Considerations
- Show a skeleton UI immediately while the AI call is in-flight — never show a blank screen
- Implement request cancellation when the user navigates away mid-stream
- Cache AI responses locally for repeated identical queries
- Use optimistic UI updates where the response is predictable
- Handle offline gracefully — queue requests and retry when connectivity returns
AI-enhanced mobile apps that handle the streaming, offline, and error states properly feel dramatically better than those that don't. These UX details matter as much as the underlying model quality.